Chapter 4. Configuring persistent storage and inferencing the model


You should configure persistent storage for AI Inference Server to store the model images before you inference the model.

Note

Configuring persistent storage is an optional but recommended step.

Prerequisites

  • You have installed a mirror registry on the bastion host.
  • You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

  1. In the disconnected OpenShift Container Platform cluster, configure persistent storage using Network File System (NFS).
  2. Create a Deployment custom resource (CR). For example, the following Deployment CR uses AI Inference Server to serve a Granite model on a CUDA accelerator.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: granite
      namespace: rhaiis-namespace 
    1
    
      labels:
        app: granite
    spec:
      replicas: 0
      selector:
        matchLabels:
          app: granite
      template:
        metadata:
          labels:
            app: granite
        spec:
          containers:
            - name: granite
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:137ac606b87679c90658985ef1fc9a26a97bb11f622b988fe5125f33e6f35d78'
              imagePullPolicy: IfNotPresent
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/mnt/models'
                - '--served-model-name=granite-3.1-2b-instruct-quantized.w8a8'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: cache-volume
                  mountPath: /mnt/models
                - name: shm
                  mountPath: /dev/shm 
    2
    
          volumes:
            - name: cache-volume
              persistentVolumeClaim:
                claimName: granite-31-w8a8
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 2Gi
          restartPolicy: Always
    Copy to Clipboard Toggle word wrap
    1
    The metadata.namespace value must match the namespace where you configure the Hugging Face Secret CR.
    2
    The /dev/shm volume mount is required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail when the /dev/shm volume mount is not set.
  3. Create a Service CR for the model inference. For example:

    apiVersion: v1
    kind: Service
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      selector:
        app: granite
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
    Copy to Clipboard Toggle word wrap
  4. Optional. Create a Route CR to enable public access to the model. For example:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: granite
      port:
        targetPort: 80
    Copy to Clipboard Toggle word wrap
  5. Get the URL for the exposed route:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    granite-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

  6. Query the model by running the following command:

    curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "granite-3.1-2b-instruct-quantized.w8a8",
        "messages": [{"role": "user", "content": "What is AI?"}],
        "temperature": 0.1
      }'
    Copy to Clipboard Toggle word wrap
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat