このコンテンツは選択した言語では利用できません。

Chapter 2. Configuring model servers


You configure model servers by using model-serving runtimes, which add support for a specified set of model frameworks and the model formats that they support.

2.1. Enabling the model serving platform

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have installed KServe.

Procedure

  1. Enable the model serving platform as follows:

    1. In the left menu, click Settings Cluster settings General settings.
    2. Locate the Model serving platforms section.
    3. To enable the model serving platform for projects, select the Model serving platform checkbox.
    4. Click Save changes.
  2. Enable preinstalled runtimes for the model serving platform as follows:

    1. In the left menu of the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

      The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.

      For more information about preinstalled runtimes, see Supported runtimes.

    2. Set the runtime that you want to use to Enabled.

      The model serving platform is now available for model deployments.

2.2. Enabling speculative decoding and multi-modal inferencing

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

  1. Follow the steps to deploy a model as described in Deploying models on the model serving platform.
  2. In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
  3. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --speculative-model=[ngram]
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
    --use-v2-block-manager
    1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.

      Note

      Inferencing throughput varies depending on the model used for speculating with n-grams.

  4. To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --port=8080
    --served-model-name={{.Name}}
    --distributed-executor-backend=mp
    --model=/mnt/models/<path_to_original_model>
    --speculative-model=/mnt/models/<path_to_speculative_model>
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --use-v2-block-manager
    1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
    2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
  5. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --trust-remote-code
    Note

    Only use the --trust-remote-code argument with models from trusted sources.

  6. Click Deploy.

Verification

  • If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
  • If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    -d '{"model":"<model_name>",
         "messages":
            [{"role":"<role>",
              "content":
                 [{"type":"text", "text":"<text>"
                  },
                  {"type":"image_url", "image_url":"<image_url_link>"
                  }
                 ]
             }
            ]
        }'

2.3. Adding a custom model-serving runtime

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

  1. From the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
    • To add a new custom runtime, click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.
      2. In the file browser, select a YAML file on your computer.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.
      2. Enter or paste YAML code directly in the embedded editor.
    Note

    In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.

  6. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  7. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

2.4. Adding a tested and verified runtime

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.

You can use the Red Hat OpenShift AI dashboard to add and enable tested and verified runtimes for the model serving platform. You can then choose the runtime when you deploy a model on the model serving platform.

Prerequisites

Procedure

  1. From the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Click Start from scratch.
  6. Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-ppc64le-runtime
        annotations:
          openshift.io/display-name: Triton Server ServingRuntime for KServe(ppc64le)
      spec:
        supportedModelFormats:
          - name: FIL
            version: "1"
            autoSelect: true
          - name: python
            version: "1"
            autoSelect: true
          - name: onnx
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        multiModel: false
        containers:
          - command:
              - tritonserver
              - --model-repository=/mnt/models
            name: kserve-container
            image: quay.io/powercloud/tritonserver:latest
            resources:
              requests:
                cpu: 2
                memory: 8Gi
              limits:
                cpu: 2
                memory: 8Gi
            ports:
              - containerPort: 8000
  7. Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8000
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8001
                name: grpc
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        volumes:
          - emptyDir: null
            medium: Memory
            sizeLimit: 2Gi
            name: shm
  8. Follow these steps to add the NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
        volumes:
          - name: shm
            emptyDir: null
              medium: Memory
              sizeLimit: 2Gi
  9. In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
  10. Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: kserve-triton
      annotations:
        openshift.io/display-name: Triton ServingRuntime
    Note

    If you do not configure a custom display name for your runtime, OpenShift AI shows the value of the metadata.name field.

  11. Click Create.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  12. Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

  • The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

Red Hat ドキュメントについて

Legal Notice

Theme

© 2026 Red Hat
トップに戻る