Chapter 5. Inference serving the model in the disconnected environment

Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve the language model from cluster persistent storage.

Prerequisites

You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
You have added the model and Red Hat AI Inference Server images to the mirror registry.
You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

In the disconnected cluster, configure persistent storage using Network File System (NFS) and make the model available in the persistent storage that you configure.
Note
For more information, see Persistent storage using NFS.

Create a Deployment custom resource (CR). For example, the following Deployment CR uses AI Inference Server to serve a Granite model on a CUDA accelerator.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite
  namespace: rhaiis-namespace
  labels:
    app: granite
spec:
  replicas: 0
  selector:
    matchLabels:
      app: granite
  template:
    metadata:
      labels:
        app: granite
    spec:
      containers:
        - name: granite
          image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:137ac606b87679c90658985ef1fc9a26a97bb11f622b988fe5125f33e6f35d78'
          imagePullPolicy: IfNotPresent
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=/mnt/models' 
            - '--served-model-name=granite-3.1-2b-instruct-quantized.w8a8'
            - '--tensor-parallel-size=1'
          resources:
            limits:
              cpu: '10'
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 6Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: cache-volume
              mountPath: /mnt/models
            - name: shm
              mountPath: /dev/shm 
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: granite-31-w8a8
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
      restartPolicy: Always

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite
  namespace: rhaiis-namespace
  labels:
    app: granite
spec:
  replicas: 0
  selector:
    matchLabels:
      app: granite
  template:
    metadata:
      labels:
        app: granite
    spec:
      containers:
        - name: granite
          image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:137ac606b87679c90658985ef1fc9a26a97bb11f622b988fe5125f33e6f35d78'
          imagePullPolicy: IfNotPresent
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=/mnt/models'


            - '--served-model-name=granite-3.1-2b-instruct-quantized.w8a8'
            - '--tensor-parallel-size=1'
          resources:
            limits:
              cpu: '10'
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 6Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: cache-volume
              mountPath: /mnt/models
            - name: shm
              mountPath: /dev/shm


      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: granite-31-w8a8
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
      restartPolicy: Always

Copy to Clipboard

Toggle word wrap

1: The model that you downloaded should be available from this mounted location in the configured persistent volume.
2: The /dev/shm volume mount is required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail when the /dev/shm volume mount is not set.

Create a Service CR for the model inference. For example:

apiVersion: v1
kind: Service
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  selector:
    app: granite
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

apiVersion: v1
kind: Service
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  selector:
    app: granite
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

Copy to Clipboard

Toggle word wrap

Optional. Create a Route CR to enable public access to the model. For example:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  to:
    kind: Service
    name: granite
  port:
    targetPort: 80

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  to:
    kind: Service
    name: granite
  port:
    targetPort: 80

Copy to Clipboard

Toggle word wrap

Get the URL for the exposed route:

oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'

$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'

Copy to Clipboard

Toggle word wrap

Example output

granite-rhaiis-namespace.apps.example.com

granite-rhaiis-namespace.apps.example.com

Copy to Clipboard

Toggle word wrap

Query the model by running the following command:

curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.1-2b-instruct-quantized.w8a8",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'

curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.1-2b-instruct-quantized.w8a8",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'

Copy to Clipboard

Toggle word wrap

Chapter 5. Inference serving the model in the disconnected environment

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links