Inference serving language models in OCI-compliant model containers
Inferencing OCI-compliant models in Red Hat AI Inference Server
Abstract
Chapter 1. About OCI-compliant model containers Copy linkLink copied to clipboard!
You can inference serve OCI-compliant models in Red Hat AI Inference Server. Storing models in OCI-compliant models containers (or modelcars) is an alternative to S3 or URI-based storage for language models. OCI model images let you distribute models through container registries by using the same versioning, caching, security, and distribution infrastructure you already have for containers.
Using modelcar containers allows for faster startup times by avoiding repeated downloads, lower disk usage, and better performance with pre-fetched images. Modelcar containers can be stored in standard container registries alongside application containers, enabling unified model versioning and distribution workflows.
Before you can deploy a language model in a modelcar container in the cluster, you need to package the model in an OCI container image and then deploy the container image in the cluster.
Chapter 2. Creating a modelcar image and pushing it to a container image registry Copy linkLink copied to clipboard!
You can create a modelcar image that contains a language model that you can deploy with Red Hat AI Inference Server.
To create a modelcar image, download the model from Hugging Face and then package it into a container image and push the modelcar container to an image registry.
Prerequisites
- You have installed Python 3.11 or later.
- You have installed Podman or Docker.
- You have access to the internet to download models from Hugging Face.
- You have configured a container image registry that you can push images to and have logged in.
Procedure
Create a Python virtual environment and install the
huggingface_hubPython library:python3 -m venv venv && \ source venv/bin/activate && \ pip install --upgrade pip && \ pip install huggingface_hubCreate a model downloader Python script:
vi download_model.pyAdd the following content to the
download_model.pyfile, adjusting the value formodel_repoas required:from huggingface_hub import snapshot_download # Specify the Hugging Face repository containing the model model_repo = "ibm-granite/granite-3.1-2b-instruct" snapshot_download( repo_id=model_repo, local_dir="/models", allow_patterns=["*.safetensors", "*.json", "*.txt"], )Create a
Dockerfilefor the modelcar:FROM registry.access.redhat.com/ubi9/python-311:latest as base USER root RUN pip install huggingface-hub # Download the model file from Hugging Face COPY download_model.py . RUN python download_model.py # Final image containing only the essential model files FROM registry.access.redhat.com/ubi9/ubi-micro:9.4 # Copy the model files from the base container COPY --from=base /models /models USER 1001Build the modelcar image:
podman build . -t modelcar-example:latest --platform linux/amd64Example output
Successfully tagged localhost/modelcar-example:latestPush the modelcar image to the container registry. For example:
$ podman push modelcar-example:latest quay.io/<your_model_registry>/modelcar-example:latestExample output
Getting image source signatures Copying blob b2ed7134f853 done Copying config 4afd393610 done Writing manifest to image destination Storing signatures
Chapter 3. Inference serving modelcar container images with AI Inference Server and Podman Copy linkLink copied to clipboard!
Serve a large language model stored in a modelcar container with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators. Modelcar containers provide an OCI-compliant method for packaging and distributing language models as container images.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have created a modelcar container image containing the language model you want to serve and pushed it to a container image registry that you have access to.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
For NVIDIA GPUs:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
- If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:$ podman login registry.redhat.ioOptional: Log in to the container registry where your modelcar container image is stored. For example:
$ podman login quay.ioPull the relevant NVIDIA CUDA image by running the following command:
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0If your system has SELinux enabled, configure SELinux to allow device access:
$ sudo setsebool -P container_use_devices 1Create a folder that you will later mount as a volume in the container. Adjust the container permissions so that the container can use it.
$ mkdir -p rhaiis-cache$ chmod g+rwX rhaiis-cacheStart the AI Inference Server container image. Run the following commands:
For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in
/proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.$ ls /proc/driver/nvidia-nvswitch/devices/Example output
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0$ systemctl start nvidia-fabricmanagerImportantNVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.
Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
$ podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smiExample output
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:08:01.0 Off | 0 | | N/A 32C P0 64W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:08:02.0 Off | 0 | | N/A 29C P0 63W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+Start the AI Inference Server container with the modelcar container image mounted:
$ podman run --rm -it \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4g \ --userns=keep-id:uid=1001 \ -p 8000:8000 \ -e HF_HUB_OFFLINE=1 \ -e TRANSFORMERS_OFFLINE=1 \ --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model \ -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \ --model /model/models \ --port 8000 \ --dtype auto \ --max-model-len 4096 \ --tensor-parallel-size 2 \ --served-model-name rhelai1/modelcar-granite-8b-code-instructWhere:
--security-opt=label=disable- Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
--shm-size=4g-
Specifies the shared memory size. Increase to
8GBif you experience shared memory issues. --userns=keep-id:uid=1001-
Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass
--user=0, but this is less secure because it runs vLLM as root inside the container. -e HF_HUB_OFFLINE=1- Prevents Hugging Face Hub from connecting to the internet.
-e TRANSFORMERS_OFFLINE=1- Configures the Transformers library to use only the locally mounted model.
--mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model-
Mounts the modelcar container directly inside the running
rhaiis/vllm-cuda-rhel9Red Hat AI Inference Server container. -v ./rhaiis-cache:/opt/app-root/src/.cache:Z-
Mounts the cache directory with SELinux context. The
:Zsuffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the:Zsuffix. --model /model/models- Specifies the path to the model directory inside the container.
--tensor-parallel-size 2- Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.
--served-model-name rhelai1/modelcar-granite-8b-code-instruct-
Specifies a user-friendly name for the served model. If not set, the name defaults to the value of the
--modelparameter.
Verification
In a separate tab in your terminal, make a request to the model with the API.
$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital city of Ireland?", "model": "rhelai1/modelcar-granite-8b-code-instruct", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqExample output
{ "id": "cmpl-0c3d8d1ac21642529c073453bbb34b01", "object": "text_completion", "created": 1760362458, "model": "rhelai1/modelcar-granite-8b-code-instruct", "choices": [ { "index": 0, "text": "\nAnswer: Dublin", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 10, "total_tokens": 17, "completion_tokens": 7, "prompt_tokens_details": null }, "kv_transfer_params": null }
Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform Copy linkLink copied to clipboard!
Deploy a language model in a modelcar container with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that uses Red Hat AI Inference Server to inference serve the modelcar container image.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
- You have created a modelcar container image for the language model and pushed it to a container image registry.
Procedure
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a
SecretCR that contains the contents of your local~/.docker/config.jsonfile, run the following command:oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespaceCreate a
PersistentVolumeClaim(PVC) custom resource (CR) and apply it in the cluster. The following examplePVCCR uses a default IBM VPC Block persistence volume.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: rhaiis-namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: ibmc-vpc-block-10iops-tierNoteConfiguring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.
Create a
Deploymentcustom resource (CR) that pulls the modelcar image and deploys the Red Hat AI Inference Server container. Reference the following exampleDeploymentCR, which uses AI Inference Server to serve a modelcar image.apiVersion: apps/v1 kind: Deployment metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace labels: app: granite spec: replicas: 0 selector: matchLabels: app: rhaiis-oci-deploy template: metadata: labels: app: rhaiis-oci-deploy spec: imagePullSecrets: - name: docker-secret volumes: - name: model-volume persistentVolumeClaim: claimName: model-cache - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" - name: oci-auth secret: secretName: docker-secret items: - key: .dockercfg path: config.json initContainers: - name: fetch-model image: ghcr.io/oras-project/oras:v1.2.0 command: ["/bin/sh","-c"] args: - | set -e # Only pull if /model is empty if [ -z "$(ls -A /model)" ]; then echo "Pulling model…" # Update with the modelcar container image registry URL oras pull <YOUR_MODELCAR_REGISTRY_URL> \ --output /model \ else echo "Model already present, skipping pull" fi volumeMounts: - name: model-volume mountPath: /model - name: oci-auth mountPath: /auth readOnly: true containers: - name: granite image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd' imagePullPolicy: IfNotPresent env: - name: VLLM_SERVER_DEV_MODE value: '1' command: - python - '-m' - vllm.entrypoints.openai.api_server args: - '--port=8000' - '--model=/model' - '--served-model-name=ibm-granite/granite-3.1-2b-instruct' - '--tensor-parallel-size=1' resources: limits: cpu: '10' nvidia.com/gpu: '1' memory: 16Gi requests: cpu: '2' memory: 6Gi nvidia.com/gpu: '1' volumeMounts: - name: model-volume mountPath: /model - name: shm mountPath: /dev/shm restartPolicy: AlwaysWhere:
claimName: model-cache-
Specifies the persistent volume claim name. The value of
spec.template.spec.volumes.persistentVolumeClaim.claimNamemust match the name of thePVCthat you created. initContainers:- Defines a container that runs before the main application container to download the required modelcar image. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
--served-model-name=ibm-granite/granite-3.1-2b-instruct- Specifies a user-friendly name for the served model. Update this value to match the model that you are deploying.
mountPath: /dev/shm- Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.
Increase the deployment replica count to the required number. For example, run the following command:
oc scale deployment granite -n rhaiis-namespace --replicas=1Optional: Watch the deployment and ensure that it succeeds:
$ oc get deployment -n rhaiis-namespace --watchExample output
NAME READY UP-TO-DATE AVAILABLE AGE rhaiis-oci-deploy 0/1 1 0 2s rhaiis-oci-deploy 1/1 1 1 14sCreate a
ServiceCR for the model inference. For example:apiVersion: v1 kind: Service metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace spec: selector: app: rhaiis-oci-deploy ports: - name: http port: 80 targetPort: 8000Optional: Create a
RouteCR to enable public access to the model. For example:apiVersion: route.openshift.io/v1 kind: Route metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace spec: to: kind: Service name: rhaiis-oci-deploy port: targetPort: httpGet the URL for the exposed route. Run the following command:
$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'Example output
rhaiis-oci-deploy-rhaiis-namespace.apps.example.com
Verification
Ensure that the deployment is successful by querying the model. Run the following command:
curl -v -k http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions -H "Content-Type: application/json" -d '{
"model":"ibm-granite/granite-3.1-2b-instruct",
"messages":[{"role":"user","content":"Hello?"}],
"temperature":0.1
}'| jq
Example output
{
"id": "chatcmpl-07b177360eaa40a3b311c24a8e3c7f43",
"object": "chat.completion",
"created": 1755189746,
"model": "ibm-granite/granite-3.1-2b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "Hello! How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 61,
"total_tokens": 71,
"completion_tokens": 10,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}