Inference serving language models in OCI-compliant model containers
Inferencing OCI-compliant models in Red Hat AI Inference Server
Abstract
Chapter 1. About OCI-compliant model containers Copy linkLink copied to clipboard!
You can inference serve OCI-compliant models in Red Hat AI Inference Server. Storing models in OCI-compliant models containers (or modelcars) is an alternative to S3 or URI-based storage for language models. OCI model images let you distribute models through container registries by using the same versioning, caching, security, and distribution infrastructure you already have for containers.
Using modelcar containers allows for faster startup times by avoiding repeated downloads, lower disk usage, and better performance with pre-fetched images. Modelcar containers can be stored in standard container registries alongside application containers, enabling unified model versioning and distribution workflows.
Before you can deploy a language model in a modelcar container in the cluster, you need to package the model in an OCI container image and then deploy the container image in the cluster.
Chapter 2. Creating a modelcar image and pushing it to a container image registry Copy linkLink copied to clipboard!
You can create a modelcar image that contains a language model that you can deploy with Red Hat AI Inference Server.
To create a modelcar image, download the model from Hugging Face and then package it into a container image and push the modelcar container to an image registry.
Prerequisites
- You have installed Python 3.11 or later.
- You have installed Podman or Docker.
- You have access to the internet to download models from Hugging Face.
- You have configured a container image registry that you can push images to and have logged in.
Procedure
Create a Python virtual environment and install the
huggingface_hubPython library:python3 -m venv venv && \ source venv/bin/activate && \ pip install --upgrade pip && \ pip install huggingface_hub
python3 -m venv venv && \ source venv/bin/activate && \ pip install --upgrade pip && \ pip install huggingface_hubCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a model downloader Python script:
vi download_model.py
vi download_model.pyCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following content to the
download_model.pyfile, adjusting the value formodel_repoas required:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
Dockerfilefor the modelcar:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Build the modelcar image:
podman build . -t modelcar-example:latest --platform linux/amd64
podman build . -t modelcar-example:latest --platform linux/amd64Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Successfully tagged localhost/modelcar-example:latest
Successfully tagged localhost/modelcar-example:latestCopy to Clipboard Copied! Toggle word wrap Toggle overflow Push the modelcar image to the container registry. For example:
podman push modelcar-example:latest quay.io/<your_model_registry>/modelcar-example:latest
$ podman push modelcar-example:latest quay.io/<your_model_registry>/modelcar-example:latestCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Getting image source signatures Copying blob b2ed7134f853 done Copying config 4afd393610 done Writing manifest to image destination Storing signatures
Getting image source signatures Copying blob b2ed7134f853 done Copying config 4afd393610 done Writing manifest to image destination Storing signaturesCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 3. Inference serving modelcar container images with AI Inference Server and Podman Copy linkLink copied to clipboard!
Serve a large language model stored in a modelcar container with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators. Modelcar containers provide an OCI-compliant method for packaging and distributing language models as container images.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have created a modelcar container image containing the language model you want to serve and pushed it to a container image registry that you have access to.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
For NVIDIA GPUs:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
- If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Log in to the container registry where your modelcar container image is stored. For example:
podman login quay.io
$ podman login quay.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the relevant NVIDIA CUDA image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a folder that you will later mount as a volume in the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache
$ mkdir -p rhaiis-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX rhaiis-cache
$ chmod g+rwX rhaiis-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image. Run the following commands:
For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in
/proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.ls /proc/driver/nvidia-nvswitch/devices/
$ ls /proc/driver/nvidia-nvswitch/devices/Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow systemctl start nvidia-fabricmanager
$ systemctl start nvidia-fabricmanagerCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantNVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.
Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smi
$ podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smiCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container with the modelcar container image mounted:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Where:
--security-opt=label=disable- Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
--shm-size=4g-
Specifies the shared memory size. Increase to
8GBif you experience shared memory issues. --userns=keep-id:uid=1001-
Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass
--user=0, but this is less secure because it runs vLLM as root inside the container. -e HF_HUB_OFFLINE=1- Prevents Hugging Face Hub from connecting to the internet.
-e TRANSFORMERS_OFFLINE=1- Configures the Transformers library to use only the locally mounted model.
--mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model-
Mounts the modelcar container directly inside the running
rhaiis/vllm-cuda-rhel9Red Hat AI Inference Server container. -v ./rhaiis-cache:/opt/app-root/src/.cache:Z-
Mounts the cache directory with SELinux context. The
:Zsuffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the:Zsuffix. --model /model/models- Specifies the path to the model directory inside the container.
--tensor-parallel-size 2- Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.
--served-model-name rhelai1/modelcar-granite-8b-code-instruct-
Specifies a user-friendly name for the served model. If not set, the name defaults to the value of the
--modelparameter.
Verification
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital city of Ireland?", "model": "rhelai1/modelcar-granite-8b-code-instruct", "max_tokens": 50 }' http://localhost:8000/v1/completions | jq$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital city of Ireland?", "model": "rhelai1/modelcar-granite-8b-code-instruct", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform Copy linkLink copied to clipboard!
Deploy a language model in a modelcar container with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that uses Red Hat AI Inference Server to inference serve the modelcar container image.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
- You have created a modelcar container image for the language model and pushed it to a container image registry.
Procedure
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a
SecretCR that contains the contents of your local~/.docker/config.jsonfile, run the following command:oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
PersistentVolumeClaim(PVC) custom resource (CR) and apply it in the cluster. The following examplePVCCR uses a default IBM VPC Block persistence volume.Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteConfiguring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.
Create a
Deploymentcustom resource (CR) that pulls the modelcar image and deploys the Red Hat AI Inference Server container. Reference the following exampleDeploymentCR, which uses AI Inference Server to serve a modelcar image.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Where:
claimName: model-cache-
Specifies the persistent volume claim name. The value of
spec.template.spec.volumes.persistentVolumeClaim.claimNamemust match the name of thePVCthat you created. initContainers:- Defines a container that runs before the main application container to download the required modelcar image. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
--served-model-name=ibm-granite/granite-3.1-2b-instruct- Specifies a user-friendly name for the served model. Update this value to match the model that you are deploying.
mountPath: /dev/shm- Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.
Increase the deployment replica count to the required number. For example, run the following command:
oc scale deployment granite -n rhaiis-namespace --replicas=1
oc scale deployment granite -n rhaiis-namespace --replicas=1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Watch the deployment and ensure that it succeeds:
oc get deployment -n rhaiis-namespace --watch
$ oc get deployment -n rhaiis-namespace --watchCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY UP-TO-DATE AVAILABLE AGE rhaiis-oci-deploy 0/1 1 0 2s rhaiis-oci-deploy 1/1 1 1 14s
NAME READY UP-TO-DATE AVAILABLE AGE rhaiis-oci-deploy 0/1 1 0 2s rhaiis-oci-deploy 1/1 1 1 14sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
ServiceCR for the model inference. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Create a
RouteCR to enable public access to the model. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the URL for the exposed route. Run the following command:
oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
rhaiis-oci-deploy-rhaiis-namespace.apps.example.com
rhaiis-oci-deploy-rhaiis-namespace.apps.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Ensure that the deployment is successful by querying the model. Run the following command:
curl -v -k http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions -H "Content-Type: application/json" -d '{
"model":"ibm-granite/granite-3.1-2b-instruct",
"messages":[{"role":"user","content":"Hello?"}],
"temperature":0.1
}'| jq
curl -v -k http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions -H "Content-Type: application/json" -d '{
"model":"ibm-granite/granite-3.1-2b-instruct",
"messages":[{"role":"user","content":"Hello?"}],
"temperature":0.1
}'| jq
Example output