Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 2. Configuring model servers on the single-model serving platform
On the single-model serving platform, you configure model servers by using model-serving runtimes. A model-serving runtime adds support for a specified set of model frameworks and the model formats that they support.
2.1. About the single-model serving platform Link kopierenLink in die Zwischenablage kopiert!
For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.
2.1.1. Components Link kopierenLink in die Zwischenablage kopiert!
- KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
- Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
- Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.
2.1.2. Installation options Link kopierenLink in die Zwischenablage kopiert!
To install the single-model serving platform, you have the following options:
- Automated installation
If you have not already created a
ServiceMeshControlPlaneorKNativeServingresource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.For more information about automated installation, see Configuring automated installation of KServe.
- Manual installation
If you have already created a
ServiceMeshControlPlaneorKNativeServingresource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.For more information about manual installation, see Manually installing KServe.
2.1.3. Authorization Link kopierenLink in die Zwischenablage kopiert!
You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authentication for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.
To add Authorino as an authorization provider on the single-model serving platform, you have the following options:
- If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
- If you need to manually install the single-model serving platform, you must also manually configure Authorino.
For guidance on choosing an installation option for the single-model serving platform, see Installation options.
2.1.4. Monitoring Link kopierenLink in die Zwischenablage kopiert!
You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.
2.2. Enabling the single-model serving platform Link kopierenLink in die Zwischenablage kopiert!
When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- You have installed KServe.
The
spec.dashboardConfig.disableKServedashboard configuration option is set tofalse(the default).For more information about setting dashboard configuration options, see Customizing the dashboard.
Procedure
Enable the single-model serving platform as follows:
-
In the left menu, click Settings
Cluster settings. - Locate the Model serving platforms section.
- To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
Select KServe RawDeployment or Knative Serverless deployment mode.
For more information about these deployment mode options, see About KServe deployment modes.
- Click Save changes.
-
In the left menu, click Settings
Enable preinstalled runtimes for the single-model serving platform as follows:
In the left menu of the OpenShift AI dashboard, click Settings
Serving runtimes. The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
For more information about preinstalled runtimes, see Supported runtimes.
Set the runtime that you want to use to Enabled.
The single-model serving platform is now available for model deployments.
2.3. Enabling speculative decoding and multi-modal inferencing Link kopierenLink in die Zwischenablage kopiert!
You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).
You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.
The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.
Procedure
- Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
- In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
--speculative-model=[ngram] --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS> --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX> --use-v2-block-managerReplace
<NUM_SPECULATIVE_TOKENS>and<NGRAM_PROMPT_LOOKUP_MAX>with your own values.NoteInferencing throughput varies depending on the model used for speculating with n-grams.
To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
--port=8080 --served-model-name={{.Name}} --distributed-executor-backend=mp --model=/mnt/models/<path_to_original_model> --speculative-model=/mnt/models/<path_to_speculative_model> --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS> --use-v2-block-manager-
Replace
<path_to_speculative_model>and<path_to_original_model>with the paths to the speculative model and original model on your S3-compatible object storage. -
Replace
<NUM_SPECULATIVE_TOKENS>with your own value.
-
Replace
To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
--trust-remote-codeNoteOnly use the
--trust-remote-codeargument with models from trusted sources.- Click Deploy.
Verification
If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer <token>"If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer <token>" -d '{"model":"<model_name>", "messages": [{"role":"<role>", "content": [{"type":"text", "text":"<text>" }, {"type":"image_url", "image_url":"<image_url_link>" } ] } ] }'
2.4. Adding a custom model-serving runtime for the single-model serving platform Link kopierenLink in die Zwischenablage kopiert!
A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.
As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.
Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- You have built your custom runtime and added the image to a container image repository such as Quay.
Procedure
From the OpenShift AI dashboard, click Settings
Serving runtimes. The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
- In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
- In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
Upload a YAML file
- Click Upload files.
In the file browser, select a YAML file on your computer.
The embedded YAML editor opens and shows the contents of the file that you uploaded.
Enter YAML code directly in the editor
- Click Start from scratch.
- Enter or paste YAML code directly in the embedded editor.
NoteIn many cases, creating a custom runtime will require adding new or custom parameters to the
envsection of theServingRuntimespecification.Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
- Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.
Verification
- The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
2.5. Adding a tested and verified model-serving runtime for the single-model serving platform Link kopierenLink in die Zwischenablage kopiert!
In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.
You can use the Red Hat OpenShift AI dashboard to add and enable tested and verified runtimes for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- If you are deploying the IBM Z Accelerated for NVIDIA Triton Inference Server runtime, you have access to IBM Cloud Container Registry to pull the container image. For more information about obtaining credentials to the IBM Cloud Container Registry, see Downloading the IBM Z Accelerated for NVIDIA Triton Inference Server container image.
- If you are deploying the IBM Power Accelerated Triton Inference Server runtime, you can access the container image from the Triton Inference Server Quay repository.
Procedure
From the OpenShift AI dashboard, click Settings
Serving runtimes. The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
- Click Add serving runtime.
- In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
- In the Select the API protocol this runtime supports list, select REST or gRPC.
- Click Start from scratch.
Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: triton-ppc64le-runtime annotations: openshift.io/display-name: Triton Server ServingRuntime for KServe(ppc64le) spec: supportedModelFormats: - name: FIL version: "1" autoSelect: true - name: python version: "1" autoSelect: true - name: onnx version: "1" autoSelect: true - name: pytorch version: "1" autoSelect: true multiModel: false containers: - command: - tritonserver - --model-repository=/mnt/models name: kserve-container image: quay.io/powercloud/tritonserver:latest resources: requests: cpu: 2 memory: 8Gi limits: cpu: 2 memory: 8Gi ports: - containerPort: 8000
Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: ibmz-triton-rest labels: opendatahub.io/dashboard: "true" spec: containers: - name: kserve-container command: - /bin/sh - -c args: - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002 image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version> securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true seccompProfile: type: RuntimeDefault resources: limits: cpu: "2" memory: 4Gi requests: cpu: "2" memory: 4Gi ports: - containerPort: 8000 protocol: TCP protocolVersions: - v2 - grpc-v2 supportedModelFormats: - name: onnx-mlir version: "1" autoSelect: true - name: snapml version: "1" autoSelect: true - name: pytorch version: "1" autoSelect: trueIf you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: ibmz-triton-grpc labels: opendatahub.io/dashboard: "true" spec: containers: - name: kserve-container command: - /bin/sh - -c args: - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002 image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version> securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true seccompProfile: type: RuntimeDefault resources: limits: cpu: "2" memory: 4Gi requests: cpu: "2" memory: 4Gi ports: - containerPort: 8001 name: grpc protocol: TCP volumeMounts: - mountPath: /dev/shm name: shm protocolVersions: - v2 - grpc-v2 supportedModelFormats: - name: onnx-mlir version: "1" autoSelect: true - name: snapml version: "1" autoSelect: true - name: pytorch version: "1" autoSelect: true volumes: - emptyDir: null medium: Memory sizeLimit: 2Gi name: shm
Follow these steps to add the NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: triton-kserve-rest labels: opendatahub.io/dashboard: "true" spec: annotations: prometheus.kserve.io/path: /metrics prometheus.kserve.io/port: "8002" containers: - args: - tritonserver - --model-store=/mnt/models - --grpc-port=9000 - --http-port=8080 - --allow-grpc=true - --allow-http=true image: nvcr.io/nvidia/tritonserver@sha256:xxxxx name: kserve-container resources: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi ports: - containerPort: 8080 protocol: TCP protocolVersions: - v2 - grpc-v2 supportedModelFormats: - autoSelect: true name: tensorrt version: "8" - autoSelect: true name: tensorflow version: "1" - autoSelect: true name: tensorflow version: "2" - autoSelect: true name: onnx version: "1" - name: pytorch version: "1" - autoSelect: true name: triton version: "2" - autoSelect: true name: xgboost version: "1" - autoSelect: true name: python version: "1"If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: triton-kserve-grpc labels: opendatahub.io/dashboard: "true" spec: annotations: prometheus.kserve.io/path: /metrics prometheus.kserve.io/port: "8002" containers: - args: - tritonserver - --model-store=/mnt/models - --grpc-port=9000 - --http-port=8080 - --allow-grpc=true - --allow-http=true image: nvcr.io/nvidia/tritonserver@sha256:xxxxx name: kserve-container ports: - containerPort: 9000 name: h2c protocol: TCP volumeMounts: - mountPath: /dev/shm name: shm resources: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi protocolVersions: - v2 - grpc-v2 supportedModelFormats: - autoSelect: true name: tensorrt version: "8" - autoSelect: true name: tensorflow version: "1" - autoSelect: true name: tensorflow version: "2" - autoSelect: true name: onnx version: "1" - name: pytorch version: "1" - autoSelect: true name: triton version: "2" - autoSelect: true name: xgboost version: "1" - autoSelect: true name: python version: "1" volumes: - name: shm emptyDir: null medium: Memory sizeLimit: 2Gi
Follow these steps to add the Seldon MLServer runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: mlserver-kserve-rest labels: opendatahub.io/dashboard: "true" spec: annotations: openshift.io/display-name: Seldon MLServer prometheus.kserve.io/port: "8080" prometheus.kserve.io/path: /metrics containers: - name: kserve-container image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390' env: - name: MLSERVER_HTTP_PORT value: "8080" - name: MLSERVER_GRPC_PORT value: "9000" - name: MODELS_DIR value: /mnt/models resources: requests: cpu: "1" memory: 2Gi limits: cpu: "1" memory: 2Gi ports: - containerPort: 8080 protocol: TCP securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false runAsNonRoot: true protocolVersions: - v2 multiModel: false supportedModelFormats: - name: sklearn version: "0" autoSelect: true priority: 2 - name: sklearn version: "1" autoSelect: true priority: 2 - name: xgboost version: "1" autoSelect: true priority: 2 - name: xgboost version: "2" autoSelect: true priority: 2 - name: lightgbm version: "3" autoSelect: true priority: 2 - name: lightgbm version: "4" autoSelect: true priority: 2 - name: mlflow version: "1" autoSelect: true priority: 1 - name: mlflow version: "2" autoSelect: true priority: 1 - name: catboost version: "1" autoSelect: true priority: 1 - name: huggingface version: "1" autoSelect: true priority: 1If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: mlserver-kserve-grpc labels: opendatahub.io/dashboard: "true" spec: annotations: openshift.io/display-name: Seldon MLServer prometheus.kserve.io/port: "8080" prometheus.kserve.io/path: /metrics containers: - name: kserve-container image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390' env: - name: MLSERVER_HTTP_PORT value: "8080" - name: MLSERVER_GRPC_PORT value: "9000" - name: MODELS_DIR value: /mnt/models resources: requests: cpu: "1" memory: 2Gi limits: cpu: "1" memory: 2Gi ports: - containerPort: 9000 name: h2c protocol: TCP securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false runAsNonRoot: true protocolVersions: - v2 multiModel: false supportedModelFormats: - name: sklearn version: "0" autoSelect: true priority: 2 - name: sklearn version: "1" autoSelect: true priority: 2 - name: xgboost version: "1" autoSelect: true priority: 2 - name: xgboost version: "2" autoSelect: true priority: 2 - name: lightgbm version: "3" autoSelect: true priority: 2 - name: lightgbm version: "4" autoSelect: true priority: 2 - name: mlflow version: "1" autoSelect: true priority: 1 - name: mlflow version: "2" autoSelect: true priority: 1 - name: catboost version: "1" autoSelect: true priority: 1 - name: huggingface version: "1" autoSelect: true priority: 1
-
In the
metadata.namefield, make sure that the value of the runtime you are adding does not match a runtime that you have already added. Optional: To use a custom display name for the runtime that you are adding, add a
metadata.annotations.openshift.io/display-namefield and specify a value, as shown in the following example:apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: kserve-triton annotations: openshift.io/display-name: Triton ServingRuntimeNoteIf you do not configure a custom display name for your runtime, OpenShift AI shows the value of the
metadata.namefield.Click Create.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
- Optional: To edit the runtime, click the action menu (⋮) and select Edit.
Verification
- The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.