Deploying Red Hat AI Inference Server in a disconnected environment
Deploy Red Hat AI Inference Server in a disconnected environment using OpenShift Container Platform and a disconnected mirror image registry
Abstract
Preface Copy linkLink copied to clipboard!
You can deploy Red Hat AI Inference Server in a disconnected OpenShift Container Platform environment that does not have direct access to the internet by mirroring Operator and OCI model container images to a local mirror registry and configuring the cluster to use the mirrored images.
Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.
After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving the OCI-compliant model.
Chapter 1. Storing models in disconnected environments Copy linkLink copied to clipboard!
You can store language models in disconnected environments using OCI-compliant model container images or persistent storage volumes. Each approach has different tradeoffs for deployment complexity, storage efficiency, and operational workflows.
Use OCI model images when you want to use Red Hat validated models and prefer a unified container-based workflow.
Use persistent storage when you need to deploy custom models, fine-tuned models, or models not available as OCI images.
- OCI model container images
OCI-compliant model container images, also known as modelcars, package language models as container images that you can store in your mirror registry alongside other container images. This approach integrates with existing container image workflows and infrastructure:
- Uses the same mirroring workflow as other container images
- Leverages existing container registry infrastructure for versioning and distribution
- Enables faster pod startup through image caching on nodes
- Simplifies model updates through standard image pull mechanisms
NoteOCI model container images require additional registry storage capacity for large model images. Model images can be 10-100 GB depending on model size and applied quantization.
Red Hat provides validated OCI model images in the
registry.redhat.io/rhelai1namespace that you can mirror to your disconnected registry.- Persistent storage volumes
You can store model files directly with persistent storage such as Network File System (NFS) volumes or other persistent volume types supported by OpenShift Container Platform. This approach requires transferring model files to the disconnected environment separately from container images. You can share a single copy of a model across multiple inference pods with the same persistent storage volume. You can store models downloaded from Hugging Face or other sources, or you can store custom or fine-tuned models not available as OCI images.
Persistent storage volumes require separate transfer and setup workflow for model files, with appropriate storage provisioning and access configuration.
Chapter 2. Setting up a mirror registry for your disconnected environment Copy linkLink copied to clipboard!
To serve container images in a disconnected environment, you must configure a disconnected mirror registry on a bastion host. The bastion host acts as a secure gateway between your disconnected environment and the internet. You then mirror images from Red Hat’s online image registries, and serve them in the disconnected environment.
Prerequisites
- You have deployed the bastion host.
-
You have installed the
ocCLI in the bastion host. - You have installed Podman in the bastion host.
- You have installed OpenShift Container Platform in the disconnected environment.
Procedure
- Open a shell prompt on the bastion host and create the disconnected mirror registry.
- Configure credentials that allow images to be mirrored.
Chapter 3. Mirroring the required images for model inference Copy linkLink copied to clipboard!
Once you have created a mirror registry for the disconnected environment, you are ready to mirror the required AI Inference Server image, AI accelerator Operator images, and OCI model container image.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed a mirror registry on the bastion host.
Procedure
Find the version of the following images that match your environment and model inference use case:
Select an OCI model container image, for example
registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5NoteYou can select any OCI model container image from the validated models list that matches your requirements. See Validated models for AI Inference Server for available options.
Create an image set configuration custom resource (CR) that includes the NFD Operator, NVIDIA GPU Operator, AI Inference Server image, and the OCI model image. For example, save the following
ImageSetConfigurationCR as the fileimageset-config.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mirror the required images into the mirror registry using a valid pull secret. Run the following command:
oc mirror --config imageset-config.yaml docker://<TARGET_MIRROR_REGISTRY_URL> --registry-config <PATH_TO_PULL_SECRET_JSON>
$ oc mirror --config imageset-config.yaml docker://<TARGET_MIRROR_REGISTRY_URL> --registry-config <PATH_TO_PULL_SECRET_JSON>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, if you have already installed the NFD and NVIDIA GPU Operators in the cluster, create an
ImageSetConfigurationCR that configures the AI Inference Server and OCI model container images only:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Mirror the image set in the disconnected environment.
- Configure the cluster for the mirror registry.
Chapter 4. Installing the Node Feature Discovery Operator Copy linkLink copied to clipboard!
Install the Node Feature Discovery Operator so that the cluster can use the AI accelerators that are available in the cluster.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Create the
NamespaceCR for the Node Feature Discovery Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
OperatorGroupCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the Node Feature Discovery Operator deployment is successful by running the following command:
oc get pods -n openshift-nfd
$ oc get pods -n openshift-nfd
Example output
NAME READY STATUS RESTARTS AGE nfd-controller-manager-7f86ccfb58-vgr4x 2/2 Running 0 10m
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-7f86ccfb58-vgr4x 2/2 Running 0 10m
Chapter 5. Installing the NVIDIA GPU Operator Copy linkLink copied to clipboard!
Install the NVIDIA GPU Operator to use the underlying NVIDIA CUDA AI accelerators that are available in the cluster.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed the Node Feature Discovery Operator.
Procedure
Create the
NamespaceCR for the NVIDIA GPU Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
OperatorGroupCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the NVIDIA GPU Operator deployment is successful by running the following command:
oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Example output
Chapter 6. Inference serving the model in the disconnected environment Copy linkLink copied to clipboard!
Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve large language models with Red Hat AI Inference Server without any connection to the outside internet by installing OpenShift Container Platform and configuring a mirrored container image registry in the disconnected environment.
Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.
This procedure uses OCI model images mirrored to your disconnected registry. Alternatively, you can download model files from Hugging Face, transfer them to persistent storage in your disconnected cluster, and mount the storage in your deployment.
Disconnected deployments require setting up a mirror registry to host container images and operator catalogs that would normally be pulled from internet-accessible registries. After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving.
Prerequisites
- You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
- You have mirrored the Red Hat AI Inference Server image and OCI model images to your mirror registry.
- You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.
Procedure
Create a namespace for the AI Inference Server deployment:
oc create namespace rhaiis-namespace
$ oc create namespace rhaiis-namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
DeploymentCR using an init container to load the model from the mirrored OCI image:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
<MIRROR_REGISTRY_URL>: Replace with the URL of your mirror registry. The init container copies model files from the OCI image to a shared volume before the inference server starts. -
mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel deployments fail without this volume mount.
-
Create a
ServiceCR for the model inference:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Create a
RouteCR to enable access to the model from outside the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Get the URL for the exposed route:
oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
granite-rhaiis-namespace.apps.example.com
granite-rhaiis-namespace.apps.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Query the model to verify the deployment:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The model returns an answer in a valid JSON response.