Inference serving language models in OCI-compliant model containers

Red Hat AI Inference Server 3.3

Inferencing OCI-compliant models in Red Hat AI Inference Server

Red Hat AI Documentation Team

Abstract

Move language models from a local or public registry to OpenShift clusters in a fully supported, GPU-accelerated path by using OCI container mounts.

Chapter 1. About OCI-compliant model containers
Copy link

You can inference serve OCI-compliant models in Red Hat AI Inference Server. Storing models in OCI-compliant models containers (or modelcars) is an alternative to S3 or URI-based storage for language models. OCI model images let you distribute models through container registries by using the same versioning, caching, security, and distribution infrastructure you already have for containers.

Using modelcar containers allows for faster startup times by avoiding repeated downloads, lower disk usage, and better performance with pre-fetched images. Modelcar containers can be stored in standard container registries alongside application containers, enabling unified model versioning and distribution workflows.

Before you can deploy a language model in a modelcar container in the cluster, you need to package the model in an OCI container image and then deploy the container image in the cluster.

Chapter 2. Creating a modelcar image and pushing it to a container image registry
Copy link

You can create a modelcar image that contains a language model that you can deploy with Red Hat AI Inference Server.

To create a modelcar image, download the model from Hugging Face and then package it into a container image and push the modelcar container to an image registry.

Prerequisites

You have installed Python 3.11 or later.
You have installed Podman or Docker.
You have access to the internet to download models from Hugging Face.
You have configured a container image registry that you can push images to and have logged in.

Procedure

Create a Python virtual environment and install the huggingface_hub Python library:

python3 -m venv venv && \
source venv/bin/activate && \
pip install --upgrade pip && \
pip install huggingface_hub

Create a model downloader Python script:
```
vi download_model.py
```

Add the following content to the download_model.py file, adjusting the value for model_repo as required:

from huggingface_hub import snapshot_download

# Specify the Hugging Face repository containing the model
model_repo = "ibm-granite/granite-3.1-2b-instruct"
snapshot_download(
    repo_id=model_repo,
    local_dir="/models",
    allow_patterns=["*.safetensors", "*.json", "*.txt"],
)

Create a Dockerfile for the modelcar:

FROM registry.access.redhat.com/ubi9/python-311:latest as base

USER root

RUN pip install huggingface-hub

# Download the model file from Hugging Face
COPY download_model.py .

RUN python download_model.py

# Final image containing only the essential model files
FROM registry.access.redhat.com/ubi9/ubi-micro:9.4

# Copy the model files from the base container
COPY --from=base /models /models

USER 1001

Build the modelcar image:

podman build . -t modelcar-example:latest --platform linux/amd64

Example output

Successfully tagged localhost/modelcar-example:latest

Push the modelcar image to the container registry. For example:

$ podman push modelcar-example:latest quay.io/<your_model_registry>/modelcar-example:latest

Example output

Getting image source signatures
Copying blob b2ed7134f853 done
Copying config 4afd393610 done
Writing manifest to image destination
Storing signatures

Chapter 3. Inference serving modelcar container images with AI Inference Server and Podman
Copy link

Serve a large language model stored in a modelcar container with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators. Modelcar containers provide an OCI-compliant method for packaging and distributing language models as container images.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have created a modelcar container image containing the language model you want to serve and pushed it to a container image registry that you have access to.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
- For NVIDIA GPUs:
  - Install NVIDIA drivers
  - Install the NVIDIA Container Toolkit
  - If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```
Optional: Log in to the container registry where your modelcar container image is stored. For example:
```
$ podman login quay.io
```
Pull the relevant NVIDIA CUDA image by running the following command:
```
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
```
If your system has SELinux enabled, configure SELinux to allow device access:
```
$ sudo setsebool -P container_use_devices 1
```
Create a folder that you will later mount as a volume in the container. Adjust the container permissions so that the container can use it.
```
$ mkdir -p rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```

Start the AI Inference Server container image. Run the following commands:

For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

$ ls /proc/driver/nvidia-nvswitch/devices/

Example output

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

$ systemctl start nvidia-fabricmanager

Important

NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

$ podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Start the AI Inference Server container with the modelcar container image mounted:
```
$ podman run --rm -it \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  --shm-size=4g \
  --userns=keep-id:uid=1001 \
  -p 8000:8000 \
  -e HF_HUB_OFFLINE=1 \
  -e TRANSFORMERS_OFFLINE=1 \
  --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model \
  -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
  --model /model/models \
  --port 8000 \
  --dtype auto \
  --max-model-len 4096 \
  --tensor-parallel-size 2 \
  --served-model-name rhelai1/modelcar-granite-8b-code-instruct
```
Where:
--security-opt=label=disable
Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
--shm-size=4g
Specifies the shared memory size. Increase to 8GB if you experience shared memory issues.
--userns=keep-id:uid=1001
Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
-e HF_HUB_OFFLINE=1
Prevents Hugging Face Hub from connecting to the internet.
-e TRANSFORMERS_OFFLINE=1
Configures the Transformers library to use only the locally mounted model.
--mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model
Mounts the modelcar container directly inside the running rhaiis/vllm-cuda-rhel9 Red Hat AI Inference Server container.
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z
Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
--model /model/models
Specifies the path to the model directory inside the container.
--tensor-parallel-size 2
Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.
--served-model-name rhelai1/modelcar-granite-8b-code-instruct
Specifies a user-friendly name for the served model. If not set, the name defaults to the value of the --model parameter.

Verification

In a separate tab in your terminal, make a request to the model with the API.

$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital city of Ireland?",
    "model": "rhelai1/modelcar-granite-8b-code-instruct",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Example output

{
  "id": "cmpl-0c3d8d1ac21642529c073453bbb34b01",
  "object": "text_completion",
  "created": 1760362458,
  "model": "rhelai1/modelcar-granite-8b-code-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nAnswer: Dublin",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 17,
    "completion_tokens": 7,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform
Copy link

Deploy a language model in a modelcar container with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that uses Red Hat AI Inference Server to inference serve the modelcar container image.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.
You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
You have created a modelcar container image for the language model and pushed it to a container image registry.

Procedure

Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:
```
oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
```
Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume.
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: rhaiis-namespace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: ibmc-vpc-block-10iops-tier
```
Note
Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

Create a Deployment custom resource (CR) that pulls the modelcar image and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a modelcar image.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rhaiis-oci-deploy
  namespace: rhaiis-namespace
  labels:
    app: granite
spec:
  replicas: 0
  selector:
    matchLabels:
      app: rhaiis-oci-deploy
  template:
    metadata:
      labels:
        app: rhaiis-oci-deploy
    spec:
      imagePullSecrets:
        - name: docker-secret
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
        - name: oci-auth
          secret:
            secretName: docker-secret
            items:
              - key: .dockercfg
                path: config.json
      initContainers:
        - name: fetch-model
          image: ghcr.io/oras-project/oras:v1.2.0
          command: ["/bin/sh","-c"]
          args:
            - |
              set -e
              # Only pull if /model is empty
              if [ -z "$(ls -A /model)" ]; then
                echo "Pulling model…"
                # Update with the modelcar container image registry URL
                oras pull <YOUR_MODELCAR_REGISTRY_URL> \
                  --output /model \
              else
                echo "Model already present, skipping pull"
              fi
          volumeMounts:
            - name: model-volume
              mountPath: /model
            - name: oci-auth
              mountPath: /auth
              readOnly: true
      containers:
        - name: granite
          image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd'
          imagePullPolicy: IfNotPresent
          env:
            - name: VLLM_SERVER_DEV_MODE
              value: '1'
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=/model'
            - '--served-model-name=ibm-granite/granite-3.1-2b-instruct'
            - '--tensor-parallel-size=1'
          resources:
            limits:
              cpu: '10'
              nvidia.com/gpu: '1'
              memory: 16Gi
            requests:
              cpu: '2'
              memory: 6Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-volume
              mountPath: /model
            - name: shm
              mountPath: /dev/shm
      restartPolicy: Always

Where:

claimName: model-cache: Specifies the persistent volume claim name. The value of spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
initContainers:: Defines a container that runs before the main application container to download the required modelcar image. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
--served-model-name=ibm-granite/granite-3.1-2b-instruct: Specifies a user-friendly name for the served model. Update this value to match the model that you are deploying.
mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.

Increase the deployment replica count to the required number. For example, run the following command:
```
oc scale deployment granite -n rhaiis-namespace --replicas=1
```

Optional: Watch the deployment and ensure that it succeeds:

$ oc get deployment -n rhaiis-namespace --watch

Example output

NAME                READY   UP-TO-DATE   AVAILABLE   AGE
rhaiis-oci-deploy   0/1     1            0           2s
rhaiis-oci-deploy   1/1     1            1           14s

Create a Service CR for the model inference. For example:

apiVersion: v1
kind: Service
metadata:
  name: rhaiis-oci-deploy
  namespace: rhaiis-namespace
spec:
  selector:
    app: rhaiis-oci-deploy
  ports:
    - name: http
      port: 80
      targetPort: 8000

Optional: Create a Route CR to enable public access to the model. For example:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: rhaiis-oci-deploy
  namespace: rhaiis-namespace
spec:
  to:
    kind: Service
    name: rhaiis-oci-deploy
  port:
    targetPort: http

Get the URL for the exposed route. Run the following command:

$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'

Example output

rhaiis-oci-deploy-rhaiis-namespace.apps.example.com

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

curl -v -k   http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model":"ibm-granite/granite-3.1-2b-instruct",
    "messages":[{"role":"user","content":"Hello?"}],
    "temperature":0.1
  }'| jq

Example output

{
  "id": "chatcmpl-07b177360eaa40a3b311c24a8e3c7f43",
  "object": "chat.completion",
  "created": 1755189746,
  "model": "ibm-granite/granite-3.1-2b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "Hello! How can I assist you today?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 61,
    "total_tokens": 71,
    "completion_tokens": 10,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Legal Notice
Copy link

Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.

The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.

All other trademarks are the property of their respective owners.

Inference serving language models in OCI-compliant model containers

Inferencing OCI-compliant models in Red Hat AI Inference Server

Chapter 1. About OCI-compliant model containers
Copy link

Chapter 2. Creating a modelcar image and pushing it to a container image registry
Copy link

Chapter 3. Inference serving modelcar container images with AI Inference Server and Podman
Copy link

Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform
Copy link

Legal Notice
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Inference serving language models in OCI-compliant model containers

Inferencing OCI-compliant models in Red Hat AI Inference Server

Chapter 1. About OCI-compliant model containersCopy linkLink copied to clipboard!

Chapter 2. Creating a modelcar image and pushing it to a container image registryCopy linkLink copied to clipboard!

Chapter 3. Inference serving modelcar container images with AI Inference Server and PodmanCopy linkLink copied to clipboard!

Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container PlatformCopy linkLink copied to clipboard!

Legal NoticeCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 1. About OCI-compliant model containers
Copy link

Chapter 2. Creating a modelcar image and pushing it to a container image registry
Copy link

Chapter 3. Inference serving modelcar container images with AI Inference Server and Podman
Copy link

Chapter 4. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform
Copy link

Legal Notice
Copy link