Chapter 2. Configuring model servers on the single-model serving platform


On the single-model serving platform, you configure model servers by using model-serving runtimes. A model-serving runtime adds support for a specified set of model frameworks and the model formats that they support.

2.1. About the single-model serving platform

For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

2.1.1. Components

  • KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
  • Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
  • Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.

2.1.2. Installation options

To install the single-model serving platform, you have the following options:

Automated installation

If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.

For more information about automated installation, see Configuring automated installation of KServe.

Manual installation

If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

For more information about manual installation, see Manually installing KServe.

2.1.3. Authorization

You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authentication for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.

To add Authorino as an authorization provider on the single-model serving platform, you have the following options:

  • If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
  • If you need to manually install the single-model serving platform, you must also manually configure Authorino.

For guidance on choosing an installation option for the single-model serving platform, see Installation options.

2.1.4. Monitoring

You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.

2.2. Enabling the single-model serving platform

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have installed KServe.
  • The spec.dashboardConfig.disableKServe dashboard configuration option is set to false (the default).

    For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

  1. Enable the single-model serving platform as follows:

    1. In the left menu, click Settings Cluster settings.
    2. Locate the Model serving platforms section.
    3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
    4. Select KServe RawDeployment or Knative Serverless deployment mode.

      For more information about these deployment mode options, see About KServe deployment modes.

    5. Click Save changes.
  2. Enable preinstalled runtimes for the single-model serving platform as follows:

    1. In the left menu of the OpenShift AI dashboard, click Settings Serving runtimes.

      The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.

      For more information about preinstalled runtimes, see Supported runtimes.

    2. Set the runtime that you want to use to Enabled.

      The single-model serving platform is now available for model deployments.

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

  1. Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
  2. In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
  3. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --speculative-model=[ngram]
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
    --use-v2-block-manager
    1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.

      Note

      Inferencing throughput varies depending on the model used for speculating with n-grams.

  4. To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --port=8080
    --served-model-name={{.Name}}
    --distributed-executor-backend=mp
    --model=/mnt/models/<path_to_original_model>
    --speculative-model=/mnt/models/<path_to_speculative_model>
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --use-v2-block-manager
    1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
    2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
  5. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --trust-remote-code
    Note

    Only use the --trust-remote-code argument with models from trusted sources.

  6. Click Deploy.

Verification

  • If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
  • If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    -d '{"model":"<model_name>",
         "messages":
            [{"role":"<role>",
              "content":
                 [{"type":"text", "text":"<text>"
                  },
                  {"type":"image_url", "image_url":"<image_url_link>"
                  }
                 ]
             }
            ]
        }'

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

  1. From the OpenShift AI dashboard, click Settings Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
    • To add a new custom runtime, click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.
      2. In the file browser, select a YAML file on your computer.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.
      2. Enter or paste YAML code directly in the embedded editor.
    Note

    In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.

  6. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  7. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.

You can use the Red Hat OpenShift AI dashboard to add and enable tested and verified runtimes for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites

Procedure

  1. From the OpenShift AI dashboard, click Settings Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Click Start from scratch.
  6. Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-ppc64le-runtime
        annotations:
          openshift.io/display-name: Triton Server ServingRuntime for KServe(ppc64le)
      spec:
        supportedModelFormats:
          - name: FIL
            version: "1"
            autoSelect: true
          - name: python
            version: "1"
            autoSelect: true
          - name: onnx
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        multiModel: false
        containers:
          - command:
              - tritonserver
              - --model-repository=/mnt/models
            name: kserve-container
            image: quay.io/powercloud/tritonserver:latest
            resources:
              requests:
                cpu: 2
                memory: 8Gi
              limits:
                cpu: 2
                memory: 8Gi
            ports:
              - containerPort: 8000
  7. Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8000
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8001
                name: grpc
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        volumes:
          - emptyDir: null
            medium: Memory
            sizeLimit: 2Gi
            name: shm
  8. Follow these steps to add the NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
        volumes:
          - name: shm
            emptyDir: null
              medium: Memory
              sizeLimit: 2Gi
  9. Follow these steps to add the Seldon MLServer runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: mlserver-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          openshift.io/display-name: Seldon MLServer
          prometheus.kserve.io/port: "8080"
          prometheus.kserve.io/path: /metrics
        containers:
          - name: kserve-container
            image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
            env:
              - name: MLSERVER_HTTP_PORT
                value: "8080"
              - name: MLSERVER_GRPC_PORT
                value: "9000"
              - name: MODELS_DIR
                value: /mnt/models
            resources:
              requests:
                cpu: "1"
                memory: 2Gi
              limits:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              privileged: false
              runAsNonRoot: true
        protocolVersions:
          - v2
        multiModel: false
        supportedModelFormats:
          - name: sklearn
            version: "0"
            autoSelect: true
            priority: 2
          - name: sklearn
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "2"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "3"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "4"
            autoSelect: true
            priority: 2
          - name: mlflow
            version: "1"
            autoSelect: true
            priority: 1
          - name: mlflow
            version: "2"
            autoSelect: true
            priority: 1
          - name: catboost
            version: "1"
            autoSelect: true
            priority: 1
          - name: huggingface
            version: "1"
            autoSelect: true
            priority: 1
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: mlserver-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          openshift.io/display-name: Seldon MLServer
          prometheus.kserve.io/port: "8080"
          prometheus.kserve.io/path: /metrics
        containers:
          - name: kserve-container
            image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
            env:
              - name: MLSERVER_HTTP_PORT
                value: "8080"
              - name: MLSERVER_GRPC_PORT
                value: "9000"
              - name: MODELS_DIR
                value: /mnt/models
            resources:
              requests:
                cpu: "1"
                memory: 2Gi
              limits:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              privileged: false
              runAsNonRoot: true
        protocolVersions:
          - v2
        multiModel: false
        supportedModelFormats:
          - name: sklearn
            version: "0"
            autoSelect: true
            priority: 2
          - name: sklearn
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "2"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "3"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "4"
            autoSelect: true
            priority: 2
          - name: mlflow
            version: "1"
            autoSelect: true
            priority: 1
          - name: mlflow
            version: "2"
            autoSelect: true
            priority: 1
          - name: catboost
            version: "1"
            autoSelect: true
            priority: 1
          - name: huggingface
            version: "1"
            autoSelect: true
            priority: 1
  10. In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
  11. Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: kserve-triton
      annotations:
        openshift.io/display-name: Triton ServingRuntime
    Note

    If you do not configure a custom display name for your runtime, OpenShift AI shows the value of the metadata.name field.

  12. Click Create.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  13. Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

  • The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top