이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Configuring model servers


You configure model servers by using model-serving runtimes, which add support for a specified set of model frameworks and the model formats that they support.

2.1. Enabling the model serving platform

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have installed KServe.
  • The spec.dashboardConfig.disableKServe dashboard configuration option is set to false (the default).

    For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

  1. Enable the model serving platform as follows:

    1. In the left menu, click Settings Cluster settings General settings.
    2. Locate the Model serving platforms section.
    3. To enable the model serving platform for projects, select the Model serving platform checkbox.
    4. Click Save changes.
  2. Enable preinstalled runtimes for the model serving platform as follows:

    1. In the left menu of the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

      The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.

      For more information about preinstalled runtimes, see Supported runtimes.

    2. Set the runtime that you want to use to Enabled.

      The model serving platform is now available for model deployments.

2.2. Enabling speculative decoding and multi-modal inferencing

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

  1. Follow the steps to deploy a model as described in Deploying models on the model serving platform.
  2. In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
  3. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --speculative-model=[ngram]
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
    --use-v2-block-manager
    Copy to Clipboard Toggle word wrap
    1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.

      Note

      Inferencing throughput varies depending on the model used for speculating with n-grams.

  4. To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --port=8080
    --served-model-name={{.Name}}
    --distributed-executor-backend=mp
    --model=/mnt/models/<path_to_original_model>
    --speculative-model=/mnt/models/<path_to_speculative_model>
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --use-v2-block-manager
    Copy to Clipboard Toggle word wrap
    1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
    2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
  5. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --trust-remote-code
    Copy to Clipboard Toggle word wrap
    Note

    Only use the --trust-remote-code argument with models from trusted sources.

  6. Click Deploy.

Verification

  • If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    Copy to Clipboard Toggle word wrap
  • If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    -d '{"model":"<model_name>",
         "messages":
            [{"role":"<role>",
              "content":
                 [{"type":"text", "text":"<text>"
                  },
                  {"type":"image_url", "image_url":"<image_url_link>"
                  }
                 ]
             }
            ]
        }'
    Copy to Clipboard Toggle word wrap

2.3. Adding a custom model-serving runtime

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

  1. From the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
    • To add a new custom runtime, click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.
      2. In the file browser, select a YAML file on your computer.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.
      2. Enter or paste YAML code directly in the embedded editor.
    Note

    In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.

  6. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  7. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

2.4. Adding a tested and verified runtime

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.

You can use the Red Hat OpenShift AI dashboard to add and enable tested and verified runtimes for the model serving platform. You can then choose the runtime when you deploy a model on the model serving platform.

Prerequisites

Procedure

  1. From the OpenShift AI dashboard, click Settings Model resources and operations Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Click Start from scratch.
  6. Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-ppc64le-runtime
        annotations:
          openshift.io/display-name: Triton Server ServingRuntime for KServe(ppc64le)
      spec:
        supportedModelFormats:
          - name: FIL
            version: "1"
            autoSelect: true
          - name: python
            version: "1"
            autoSelect: true
          - name: onnx
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        multiModel: false
        containers:
          - command:
              - tritonserver
              - --model-repository=/mnt/models
            name: kserve-container
            image: quay.io/powercloud/tritonserver:latest
            resources:
              requests:
                cpu: 2
                memory: 8Gi
              limits:
                cpu: 2
                memory: 8Gi
            ports:
              - containerPort: 8000
      Copy to Clipboard Toggle word wrap
  7. Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8000
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
      Copy to Clipboard Toggle word wrap
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: ibmz-triton-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        containers:
          - name: kserve-container
            command:
              - /bin/sh
              - -c
            args:
              - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002
            image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "2"
                memory: 4Gi
            ports:
              - containerPort: 8001
                name: grpc
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - name: onnx-mlir
            version: "1"
            autoSelect: true
          - name: snapml
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        volumes:
          - emptyDir: null
            medium: Memory
            sizeLimit: 2Gi
            name: shm
      Copy to Clipboard Toggle word wrap
  8. Follow these steps to add the NVIDIA Triton Inference Server runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
      Copy to Clipboard Toggle word wrap
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
        volumes:
          - name: shm
            emptyDir: null
              medium: Memory
              sizeLimit: 2Gi
      Copy to Clipboard Toggle word wrap
  9. Follow these steps to add the Seldon MLServer runtime:

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: mlserver-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          openshift.io/display-name: Seldon MLServer
          prometheus.kserve.io/port: "8080"
          prometheus.kserve.io/path: /metrics
        containers:
          - name: kserve-container
            image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
            env:
              - name: MLSERVER_HTTP_PORT
                value: "8080"
              - name: MLSERVER_GRPC_PORT
                value: "9000"
              - name: MODELS_DIR
                value: /mnt/models
            resources:
              requests:
                cpu: "1"
                memory: 2Gi
              limits:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              privileged: false
              runAsNonRoot: true
        protocolVersions:
          - v2
        multiModel: false
        supportedModelFormats:
          - name: sklearn
            version: "0"
            autoSelect: true
            priority: 2
          - name: sklearn
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "2"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "3"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "4"
            autoSelect: true
            priority: 2
          - name: mlflow
            version: "1"
            autoSelect: true
            priority: 1
          - name: mlflow
            version: "2"
            autoSelect: true
            priority: 1
          - name: catboost
            version: "1"
            autoSelect: true
            priority: 1
          - name: huggingface
            version: "1"
            autoSelect: true
            priority: 1
      Copy to Clipboard Toggle word wrap
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: mlserver-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          openshift.io/display-name: Seldon MLServer
          prometheus.kserve.io/port: "8080"
          prometheus.kserve.io/path: /metrics
        containers:
          - name: kserve-container
            image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
            env:
              - name: MLSERVER_HTTP_PORT
                value: "8080"
              - name: MLSERVER_GRPC_PORT
                value: "9000"
              - name: MODELS_DIR
                value: /mnt/models
            resources:
              requests:
                cpu: "1"
                memory: 2Gi
              limits:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                  - ALL
              privileged: false
              runAsNonRoot: true
        protocolVersions:
          - v2
        multiModel: false
        supportedModelFormats:
          - name: sklearn
            version: "0"
            autoSelect: true
            priority: 2
          - name: sklearn
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "1"
            autoSelect: true
            priority: 2
          - name: xgboost
            version: "2"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "3"
            autoSelect: true
            priority: 2
          - name: lightgbm
            version: "4"
            autoSelect: true
            priority: 2
          - name: mlflow
            version: "1"
            autoSelect: true
            priority: 1
          - name: mlflow
            version: "2"
            autoSelect: true
            priority: 1
          - name: catboost
            version: "1"
            autoSelect: true
            priority: 1
          - name: huggingface
            version: "1"
            autoSelect: true
            priority: 1
      Copy to Clipboard Toggle word wrap
  10. In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
  11. Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: kserve-triton
      annotations:
        openshift.io/display-name: Triton ServingRuntime
    Copy to Clipboard Toggle word wrap
    Note

    If you do not configure a custom display name for your runtime, OpenShift AI shows the value of the metadata.name field.

  12. Click Create.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  13. Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

  • The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat