Chapter 6. Inference serving the model in the disconnected environment


Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve large language models with Red Hat AI Inference Server without any connection to the outside internet by installing OpenShift Container Platform and configuring a mirrored container image registry in the disconnected environment.

Important

Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.

Note

This procedure uses OCI model images mirrored to your disconnected registry. Alternatively, you can download model files from Hugging Face, transfer them to persistent storage in your disconnected cluster, and mount the storage in your deployment.

Disconnected deployments require setting up a mirror registry to host container images and operator catalogs that would normally be pulled from internet-accessible registries. After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving.

Prerequisites

  • You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
  • You have mirrored the Red Hat AI Inference Server image and OCI model images to your mirror registry.
  • You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

  1. Create a namespace for the AI Inference Server deployment:

    $ oc create namespace rhaiis-namespace
    Copy to Clipboard Toggle word wrap
  2. Create the Deployment CR using an init container to load the model from the mirrored OCI image:

    oc apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: granite
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: granite
      template:
        metadata:
          labels:
            app: granite
        spec:
          initContainers:
            - name: model-loader
              image: '<MIRROR_REGISTRY_URL>/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5'
              command: ['cp', '-r', '/models/.', '/mnt/models/']
              volumeMounts:
                - name: model-volume
                  mountPath: /mnt/models
          containers:
            - name: granite
              image: '<MIRROR_REGISTRY_URL>/rhaiis/vllm-cuda-rhel9:latest'
              imagePullPolicy: IfNotPresent
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/mnt/models'
                - '--served-model-name=granite-3.1-8b-instruct-quantized-w8a8'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: model-volume
                  mountPath: /mnt/models
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: model-volume
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 2Gi
          restartPolicy: Always
    EOF
    Copy to Clipboard Toggle word wrap
    • <MIRROR_REGISTRY_URL>: Replace with the URL of your mirror registry. The init container copies model files from the OCI image to a shared volume before the inference server starts.
    • mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel deployments fail without this volume mount.
  3. Create a Service CR for the model inference:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Service
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      selector:
        app: granite
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
    EOF
    Copy to Clipboard Toggle word wrap
  4. Optional: Create a Route CR to enable access to the model from outside the cluster:

    oc apply -f - <<EOF
    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: granite
      port:
        targetPort: 80
    EOF
    Copy to Clipboard Toggle word wrap

Verification

  1. Get the URL for the exposed route:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    granite-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

  2. Query the model to verify the deployment:

    $ curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "granite-3.1-8b-instruct-quantized-w8a8",
        "messages": [{"role": "user", "content": "What is AI?"}],
        "temperature": 0.1
      }'
    Copy to Clipboard Toggle word wrap

    The model returns an answer in a valid JSON response.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top