Deploying Red Hat AI Inference Server in a disconnected environment


Red Hat AI Inference Server 3.3

Deploy Red Hat AI Inference Server in a disconnected environment using OpenShift Container Platform and a disconnected mirror image registry

Red Hat AI Documentation Team

Abstract

Learn how to work with Red Hat AI Inference Server for model serving and inferencing in a disconnected environment.

Preface

You can deploy Red Hat AI Inference Server in a disconnected OpenShift Container Platform environment that does not have direct access to the internet by mirroring Operator and OCI model container images to a local mirror registry and configuring the cluster to use the mirrored images.

Important

Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.

After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving the OCI-compliant model.

You can store language models in disconnected environments using OCI-compliant model container images or persistent storage volumes. Each approach has different tradeoffs for deployment complexity, storage efficiency, and operational workflows.

Important

Use OCI model images when you want to use Red Hat validated models and prefer a unified container-based workflow.

Use persistent storage when you need to deploy custom models, fine-tuned models, or models not available as OCI images.

OCI model container images

OCI-compliant model container images, also known as modelcars, package language models as container images that you can store in your mirror registry alongside other container images. This approach integrates with existing container image workflows and infrastructure:

  • Uses the same mirroring workflow as other container images
  • Leverages existing container registry infrastructure for versioning and distribution
  • Enables faster pod startup through image caching on nodes
  • Simplifies model updates through standard image pull mechanisms
Note

OCI model container images require additional registry storage capacity for large model images. Model images can be 10-100 GB depending on model size and applied quantization.

Red Hat provides validated OCI model images in the registry.redhat.io/rhelai1 namespace that you can mirror to your disconnected registry.

Persistent storage volumes

You can store model files directly with persistent storage such as Network File System (NFS) volumes or other persistent volume types supported by OpenShift Container Platform. This approach requires transferring model files to the disconnected environment separately from container images. You can share a single copy of a model across multiple inference pods with the same persistent storage volume. You can store models downloaded from Hugging Face or other sources, or you can store custom or fine-tuned models not available as OCI images.

Persistent storage volumes require separate transfer and setup workflow for model files, with appropriate storage provisioning and access configuration.

To serve container images in a disconnected environment, you must configure a disconnected mirror registry on a bastion host. The bastion host acts as a secure gateway between your disconnected environment and the internet. You then mirror images from Red Hat’s online image registries, and serve them in the disconnected environment.

Prerequisites

Once you have created a mirror registry for the disconnected environment, you are ready to mirror the required AI Inference Server image, AI accelerator Operator images, and OCI model container image.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed a mirror registry on the bastion host.

Procedure

  1. Find the version of the following images that match your environment and model inference use case:

  2. Select an OCI model container image, for example registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5

    Note

    You can select any OCI model container image from the validated models list that matches your requirements. See Validated models for AI Inference Server for available options.

  3. Create an image set configuration custom resource (CR) that includes the NFD Operator, NVIDIA GPU Operator, AI Inference Server image, and the OCI model image. For example, save the following ImageSetConfiguration CR as the file imageset-config.yaml:

    apiVersion: mirror.openshift.io/v2alpha1
    kind: ImageSetConfiguration
    mirror:
      operators:
      # Node Feature Discovery (NFD) Operator
      # Helps Openshift detect hardware capabilities like GPUs
      - catalog: registry.redhat.io/openshift4/ose-cluster-nfd-operator:latest
        packages:
          - name: nfd
            defaultChannel: stable
            channels:
              - name: stable
    
      # GPU Operator
      # Manages NVIDIA GPUs on OpenShift
      - catalog: registry.connect.redhat.com/nvidia/gpu-operator-bundle:latest
        packages:
          - name: gpu-operator-certified
            defaultChannel: stable
            channels:
              - name: stable
      additionalImages:
      # Red Hat AI Inference Server image
      - name: registry.redhat.io/rhaiis/vllm-cuda-rhel9:latest
      # Model image
      - name: registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5
    Copy to Clipboard Toggle word wrap
  4. Mirror the required images into the mirror registry using a valid pull secret. Run the following command:

    $ oc mirror --config imageset-config.yaml docker://<TARGET_MIRROR_REGISTRY_URL> --registry-config <PATH_TO_PULL_SECRET_JSON>
    Copy to Clipboard Toggle word wrap
  5. Alternatively, if you have already installed the NFD and NVIDIA GPU Operators in the cluster, create an ImageSetConfiguration CR that configures the AI Inference Server and OCI model container images only:

    apiVersion: mirror.openshift.io/v2alpha1
    kind: ImageSetConfiguration
    mirror:
      additionalImages:
      - name: registry.redhat.io/rhaiis/vllm-cuda-rhel9:latest
      - name: registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5
    Copy to Clipboard Toggle word wrap
  6. Mirror the image set in the disconnected environment.
  7. Configure the cluster for the mirror registry.

Install the Node Feature Discovery Operator so that the cluster can use the AI accelerators that are available in the cluster.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.

Procedure

  1. Create the Namespace CR for the Node Feature Discovery Operator:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
      name: openshift-nfd
      labels:
        name: openshift-nfd
        openshift.io/cluster-monitoring: "true"
    EOF
    Copy to Clipboard Toggle word wrap
  2. Create the OperatorGroup CR:

    oc apply -f - <<EOF
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      generateName: openshift-nfd-
      name: openshift-nfd
      namespace: openshift-nfd
    spec:
      targetNamespaces:
      - openshift-nfd
    EOF
    Copy to Clipboard Toggle word wrap
  3. Create the Subscription CR:

    oc apply -f - <<EOF
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: nfd
      namespace: openshift-nfd
    spec:
      channel: "stable"
      installPlanApproval: Automatic
      name: nfd
      source: redhat-operators
      sourceNamespace: openshift-marketplace
    EOF
    Copy to Clipboard Toggle word wrap

Verification

Verify that the Node Feature Discovery Operator deployment is successful by running the following command:

$ oc get pods -n openshift-nfd
Copy to Clipboard Toggle word wrap

Example output

NAME                                      READY   STATUS    RESTARTS   AGE
nfd-controller-manager-7f86ccfb58-vgr4x   2/2     Running   0          10m
Copy to Clipboard Toggle word wrap

Chapter 5. Installing the NVIDIA GPU Operator

Install the NVIDIA GPU Operator to use the underlying NVIDIA CUDA AI accelerators that are available in the cluster.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed the Node Feature Discovery Operator.

Procedure

  1. Create the Namespace CR for the NVIDIA GPU Operator:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
      name: nvidia-gpu-operator
    EOF
    Copy to Clipboard Toggle word wrap
  2. Create the OperatorGroup CR:

    oc apply -f - <<EOF
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      name: gpu-operator-certified
      namespace: nvidia-gpu-operator
    spec:
     targetNamespaces:
     - nvidia-gpu-operator
    EOF
    Copy to Clipboard Toggle word wrap
  3. Create the Subscription CR:

    oc apply -f - <<EOF
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: gpu-operator-certified
      namespace: nvidia-gpu-operator
    spec:
      channel: "stable"
      installPlanApproval: Manual
      name: gpu-operator-certified
      source: certified-operators
      sourceNamespace: openshift-marketplace
    EOF
    Copy to Clipboard Toggle word wrap

Verification

Verify that the NVIDIA GPU Operator deployment is successful by running the following command:

$ oc get pods -n nvidia-gpu-operator
Copy to Clipboard Toggle word wrap

Example output

NAME                                                  READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-c2rfm                           1/1     Running    0          6m28s
gpu-operator-84b7f5bcb9-vqds7                         1/1     Running    0          39m
nvidia-container-toolkit-daemonset-pgcrf              1/1     Running    0          6m28s
nvidia-cuda-validator-p8gv2                           0/1     Completed  0          99s
nvidia-dcgm-exporter-kv6k8                            1/1     Running    0          6m28s
nvidia-dcgm-tpsps                                     1/1     Running    0          6m28s
nvidia-device-plugin-daemonset-gbn55                  1/1     Running    0          6m28s
nvidia-device-plugin-validator-z7ltr                  0/1     Completed  0          82s
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv   2/2     Running    0          6m28s
nvidia-node-status-exporter-snmsm                     1/1     Running    0          6m28s
nvidia-operator-validator-6pfk6                       1/1     Running    0          6m28s
Copy to Clipboard Toggle word wrap

Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve large language models with Red Hat AI Inference Server without any connection to the outside internet by installing OpenShift Container Platform and configuring a mirrored container image registry in the disconnected environment.

Important

Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.

Note

This procedure uses OCI model images mirrored to your disconnected registry. Alternatively, you can download model files from Hugging Face, transfer them to persistent storage in your disconnected cluster, and mount the storage in your deployment.

Disconnected deployments require setting up a mirror registry to host container images and operator catalogs that would normally be pulled from internet-accessible registries. After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving.

Prerequisites

  • You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
  • You have mirrored the Red Hat AI Inference Server image and OCI model images to your mirror registry.
  • You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

  1. Create a namespace for the AI Inference Server deployment:

    $ oc create namespace rhaiis-namespace
    Copy to Clipboard Toggle word wrap
  2. Create the Deployment CR using an init container to load the model from the mirrored OCI image:

    oc apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: granite
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: granite
      template:
        metadata:
          labels:
            app: granite
        spec:
          initContainers:
            - name: model-loader
              image: '<MIRROR_REGISTRY_URL>/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5'
              command: ['cp', '-r', '/models/.', '/mnt/models/']
              volumeMounts:
                - name: model-volume
                  mountPath: /mnt/models
          containers:
            - name: granite
              image: '<MIRROR_REGISTRY_URL>/rhaiis/vllm-cuda-rhel9:latest'
              imagePullPolicy: IfNotPresent
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/mnt/models'
                - '--served-model-name=granite-3.1-8b-instruct-quantized-w8a8'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: model-volume
                  mountPath: /mnt/models
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: model-volume
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 2Gi
          restartPolicy: Always
    EOF
    Copy to Clipboard Toggle word wrap
    • <MIRROR_REGISTRY_URL>: Replace with the URL of your mirror registry. The init container copies model files from the OCI image to a shared volume before the inference server starts.
    • mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel deployments fail without this volume mount.
  3. Create a Service CR for the model inference:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Service
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      selector:
        app: granite
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
    EOF
    Copy to Clipboard Toggle word wrap
  4. Optional: Create a Route CR to enable access to the model from outside the cluster:

    oc apply -f - <<EOF
    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: granite
      port:
        targetPort: 80
    EOF
    Copy to Clipboard Toggle word wrap

Verification

  1. Get the URL for the exposed route:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    granite-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

  2. Query the model to verify the deployment:

    $ curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "granite-3.1-8b-instruct-quantized-w8a8",
        "messages": [{"role": "user", "content": "What is AI?"}],
        "temperature": 0.1
      }'
    Copy to Clipboard Toggle word wrap

    The model returns an answer in a valid JSON response.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top