Chapter 6. Inference serving the model in the disconnected environment
Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve large language models with Red Hat AI Inference Server without any connection to the outside internet by installing OpenShift Container Platform and configuring a mirrored container image registry in the disconnected environment.
Currently, only NVIDIA CUDA AI accelerators are supported for OpenShift Container Platform in disconnected environments.
This procedure uses OCI model images mirrored to your disconnected registry. Alternatively, you can download model files from Hugging Face, transfer them to persistent storage in your disconnected cluster, and mount the storage in your deployment.
Disconnected deployments require setting up a mirror registry to host container images and operator catalogs that would normally be pulled from internet-accessible registries. After mirroring the required images, you can install the Node Feature Discovery Operator and NVIDIA GPU Operator from the mirrored sources, then deploy Red Hat AI Inference Server for inference serving.
Prerequisites
- You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
- You have mirrored the Red Hat AI Inference Server image and OCI model images to your mirror registry.
- You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.
Procedure
Create a namespace for the AI Inference Server deployment:
oc create namespace rhaiis-namespace
$ oc create namespace rhaiis-namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
DeploymentCR using an init container to load the model from the mirrored OCI image:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
<MIRROR_REGISTRY_URL>: Replace with the URL of your mirror registry. The init container copies model files from the OCI image to a shared volume before the inference server starts. -
mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel deployments fail without this volume mount.
-
Create a
ServiceCR for the model inference:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Create a
RouteCR to enable access to the model from outside the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Get the URL for the exposed route:
oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
granite-rhaiis-namespace.apps.example.com
granite-rhaiis-namespace.apps.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Query the model to verify the deployment:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The model returns an answer in a valid JSON response.