Inference serving language models in OCI-compliant model containers


Red Hat AI Inference Server 3.3

Inferencing OCI-compliant models in Red Hat AI Inference Server

Red Hat AI Documentation Team

Abstract

Move language models from a local or public registry to OpenShift clusters in a fully supported, GPU-accelerated path by using OCI container mounts.

Chapter 1. About OCI-compliant model containers

You can inference serve OCI-compliant models in Red Hat AI Inference Server. Storing models in OCI-compliant models containers (or modelcars) is an alternative to S3 or URI-based storage for language models. OCI model images let you distribute models through container registries by using the same versioning, caching, security, and distribution infrastructure you already have for containers.

Using modelcar containers allows for faster startup times by avoiding repeated downloads, lower disk usage, and better performance with pre-fetched images. Modelcar containers can be stored in standard container registries alongside application containers, enabling unified model versioning and distribution workflows.

Before you can deploy a language model in a modelcar container in the cluster, you need to package the model in an OCI container image and then deploy the container image in the cluster.

You can create a modelcar image that contains a language model that you can deploy with Red Hat AI Inference Server.

To create a modelcar image, download the model from Hugging Face and then package it into a container image and push the modelcar container to an image registry.

Prerequisites

  • You have installed Python 3.11 or later.
  • You have installed Podman or Docker.
  • You have access to the internet to download models from Hugging Face.
  • You have configured a container image registry that you can push images to and have logged in.

Procedure

  1. Create a Python virtual environment and install the huggingface_hub Python library:

    python3 -m venv venv && \
    source venv/bin/activate && \
    pip install --upgrade pip && \
    pip install huggingface_hub
    Copy to Clipboard Toggle word wrap
  2. Create a model downloader Python script:

    vi download_model.py
    Copy to Clipboard Toggle word wrap
  3. Add the following content to the download_model.py file, adjusting the value for model_repo as required:

    from huggingface_hub import snapshot_download
    
    # Specify the Hugging Face repository containing the model
    model_repo = "ibm-granite/granite-3.1-2b-instruct"
    snapshot_download(
        repo_id=model_repo,
        local_dir="/models",
        allow_patterns=["*.safetensors", "*.json", "*.txt"],
    )
    Copy to Clipboard Toggle word wrap
  4. Create a Dockerfile for the modelcar:

    FROM registry.access.redhat.com/ubi9/python-311:latest as base
    
    USER root
    
    RUN pip install huggingface-hub
    
    # Download the model file from Hugging Face
    COPY download_model.py .
    
    RUN python download_model.py
    
    # Final image containing only the essential model files
    FROM registry.access.redhat.com/ubi9/ubi-micro:9.4
    
    # Copy the model files from the base container
    COPY --from=base /models /models
    
    USER 1001
    Copy to Clipboard Toggle word wrap
  5. Build the modelcar image:

    podman build . -t modelcar-example:latest --platform linux/amd64
    Copy to Clipboard Toggle word wrap

    Example output

    Successfully tagged localhost/modelcar-example:latest
    Copy to Clipboard Toggle word wrap

  6. Push the modelcar image to the container registry. For example:

    $ podman push modelcar-example:latest quay.io/<your_model_registry>/modelcar-example:latest
    Copy to Clipboard Toggle word wrap

    Example output

    Getting image source signatures
    Copying blob b2ed7134f853 done
    Copying config 4afd393610 done
    Writing manifest to image destination
    Storing signatures
    Copy to Clipboard Toggle word wrap

Serve a large language model stored in a modelcar container with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators. Modelcar containers provide an OCI-compliant method for packaging and distributing language models as container images.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have created a modelcar container image containing the language model you want to serve and pushed it to a container image registry that you have access to.
  • You have access to a Linux server with data center grade NVIDIA AI accelerators installed.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Optional: Log in to the container registry where your modelcar container image is stored. For example:

    $ podman login quay.io
    Copy to Clipboard Toggle word wrap
  3. Pull the relevant NVIDIA CUDA image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  4. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  5. Create a folder that you will later mount as a volume in the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  6. Start the AI Inference Server container image. Run the following commands:

    1. For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

      $ ls /proc/driver/nvidia-nvswitch/devices/
      Copy to Clipboard Toggle word wrap

      Example output

      0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
      Copy to Clipboard Toggle word wrap

      $ systemctl start nvidia-fabricmanager
      Copy to Clipboard Toggle word wrap
      Important

      NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.

      1. Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

        $ podman run --rm -it \
        --security-opt=label=disable \
        --device nvidia.com/gpu=all \
        nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
        nvidia-smi
        Copy to Clipboard Toggle word wrap

        Example output

        +-----------------------------------------------------------------------------------------+
        | NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
        |-----------------------------------------+------------------------+----------------------+
        | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
        |                                         |                        |               MIG M. |
        |=========================================+========================+======================|
        |   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
        | N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
        |                                         |                        |             Disabled |
        +-----------------------------------------+------------------------+----------------------+
        |   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
        | N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
        |                                         |                        |             Disabled |
        +-----------------------------------------+------------------------+----------------------+
        
        +-----------------------------------------------------------------------------------------+
        | Processes:                                                                              |
        |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
        |        ID   ID                                                               Usage      |
        |=========================================================================================|
        |  No running processes found                                                             |
        +-----------------------------------------------------------------------------------------+
        Copy to Clipboard Toggle word wrap

      2. Start the AI Inference Server container with the modelcar container image mounted:

        $ podman run --rm -it \
          --device nvidia.com/gpu=all \
          --security-opt=label=disable \
          --shm-size=4g \
          --userns=keep-id:uid=1001 \
          -p 8000:8000 \
          -e HF_HUB_OFFLINE=1 \
          -e TRANSFORMERS_OFFLINE=1 \
          --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model \
          -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
          registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
          --model /model/models \
          --port 8000 \
          --dtype auto \
          --max-model-len 4096 \
          --tensor-parallel-size 2 \
          --served-model-name rhelai1/modelcar-granite-8b-code-instruct
        Copy to Clipboard Toggle word wrap

        Where:

        --security-opt=label=disable
        Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
        --shm-size=4g
        Specifies the shared memory size. Increase to 8GB if you experience shared memory issues.
        --userns=keep-id:uid=1001
        Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
        -e HF_HUB_OFFLINE=1
        Prevents Hugging Face Hub from connecting to the internet.
        -e TRANSFORMERS_OFFLINE=1
        Configures the Transformers library to use only the locally mounted model.
        --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model
        Mounts the modelcar container directly inside the running rhaiis/vllm-cuda-rhel9 Red Hat AI Inference Server container.
        -v ./rhaiis-cache:/opt/app-root/src/.cache:Z
        Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
        --model /model/models
        Specifies the path to the model directory inside the container.
        --tensor-parallel-size 2
        Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.
        --served-model-name rhelai1/modelcar-granite-8b-code-instruct
        Specifies a user-friendly name for the served model. If not set, the name defaults to the value of the --model parameter.

Verification

  • In a separate tab in your terminal, make a request to the model with the API.

    $ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital city of Ireland?",
        "model": "rhelai1/modelcar-granite-8b-code-instruct",
        "max_tokens": 50
    }' http://localhost:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
      "id": "cmpl-0c3d8d1ac21642529c073453bbb34b01",
      "object": "text_completion",
      "created": 1760362458,
      "model": "rhelai1/modelcar-granite-8b-code-instruct",
      "choices": [
        {
          "index": 0,
          "text": "\nAnswer: Dublin",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null,
          "prompt_logprobs": null,
          "prompt_token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 10,
        "total_tokens": 17,
        "completion_tokens": 7,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }
    Copy to Clipboard Toggle word wrap

Deploy a language model in a modelcar container with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that uses Red Hat AI Inference Server to inference serve the modelcar container image.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
  • You have created a modelcar container image for the language model and pushed it to a container image registry.

Procedure

  1. Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:

    oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
    Copy to Clipboard Toggle word wrap
  2. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-cache
      namespace: rhaiis-namespace
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: ibmc-vpc-block-10iops-tier
    Copy to Clipboard Toggle word wrap
    Note

    Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

  3. Create a Deployment custom resource (CR) that pulls the modelcar image and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a modelcar image.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 0
      selector:
        matchLabels:
          app: rhaiis-oci-deploy
      template:
        metadata:
          labels:
            app: rhaiis-oci-deploy
        spec:
          imagePullSecrets:
            - name: docker-secret
          volumes:
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-cache
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "2Gi"
            - name: oci-auth
              secret:
                secretName: docker-secret
                items:
                  - key: .dockercfg
                    path: config.json
          initContainers:
            - name: fetch-model
              image: ghcr.io/oras-project/oras:v1.2.0
              command: ["/bin/sh","-c"]
              args:
                - |
                  set -e
                  # Only pull if /model is empty
                  if [ -z "$(ls -A /model)" ]; then
                    echo "Pulling model…"
                    # Update with the modelcar container image registry URL
                    oras pull <YOUR_MODELCAR_REGISTRY_URL> \
                      --output /model \
                  else
                    echo "Model already present, skipping pull"
                  fi
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: oci-auth
                  mountPath: /auth
                  readOnly: true
          containers:
            - name: granite
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd'
              imagePullPolicy: IfNotPresent
              env:
                - name: VLLM_SERVER_DEV_MODE
                  value: '1'
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/model'
                - '--served-model-name=ibm-granite/granite-3.1-2b-instruct'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                  memory: 16Gi
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: shm
                  mountPath: /dev/shm
          restartPolicy: Always
    Copy to Clipboard Toggle word wrap

    Where:

    claimName: model-cache
    Specifies the persistent volume claim name. The value of spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
    initContainers:
    Defines a container that runs before the main application container to download the required modelcar image. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
    --served-model-name=ibm-granite/granite-3.1-2b-instruct
    Specifies a user-friendly name for the served model. Update this value to match the model that you are deploying.
    mountPath: /dev/shm
    Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.
  4. Increase the deployment replica count to the required number. For example, run the following command:

    oc scale deployment granite -n rhaiis-namespace --replicas=1
    Copy to Clipboard Toggle word wrap
  5. Optional: Watch the deployment and ensure that it succeeds:

    $ oc get deployment -n rhaiis-namespace --watch
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                READY   UP-TO-DATE   AVAILABLE   AGE
    rhaiis-oci-deploy   0/1     1            0           2s
    rhaiis-oci-deploy   1/1     1            1           14s
    Copy to Clipboard Toggle word wrap

  6. Create a Service CR for the model inference. For example:

    apiVersion: v1
    kind: Service
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
    spec:
      selector:
        app: rhaiis-oci-deploy
      ports:
        - name: http
          port: 80
          targetPort: 8000
    Copy to Clipboard Toggle word wrap
  7. Optional: Create a Route CR to enable public access to the model. For example:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: rhaiis-oci-deploy
      port:
        targetPort: http
    Copy to Clipboard Toggle word wrap
  8. Get the URL for the exposed route. Run the following command:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    rhaiis-oci-deploy-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

curl -v -k   http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model":"ibm-granite/granite-3.1-2b-instruct",
    "messages":[{"role":"user","content":"Hello?"}],
    "temperature":0.1
  }'| jq
Copy to Clipboard Toggle word wrap

Example output

{
  "id": "chatcmpl-07b177360eaa40a3b311c24a8e3c7f43",
  "object": "chat.completion",
  "created": 1755189746,
  "model": "ibm-granite/granite-3.1-2b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "Hello! How can I assist you today?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 61,
    "total_tokens": 71,
    "completion_tokens": 10,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}
Copy to Clipboard Toggle word wrap

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top