이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 2. Configuring model servers
You configure model servers by using model-serving runtimes, which add support for a specified set of model frameworks and the model formats that they support.
2.1. Enabling the model serving platform 링크 복사링크가 클립보드에 복사되었습니다!
When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- You have installed KServe.
The
spec.dashboardConfig.disableKServedashboard configuration option is set tofalse(the default).For more information about setting dashboard configuration options, see Customizing the dashboard.
Procedure
Enable the model serving platform as follows:
-
In the left menu, click Settings
Cluster settings General settings. - Locate the Model serving platforms section.
- To enable the model serving platform for projects, select the Model serving platform checkbox.
- Click Save changes.
-
In the left menu, click Settings
Enable preinstalled runtimes for the model serving platform as follows:
In the left menu of the OpenShift AI dashboard, click Settings
Model resources and operations Serving runtimes. The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
For more information about preinstalled runtimes, see Supported runtimes.
Set the runtime that you want to use to Enabled.
The model serving platform is now available for model deployments.
2.2. Enabling speculative decoding and multi-modal inferencing 링크 복사링크가 클립보드에 복사되었습니다!
You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).
You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.
The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.
Procedure
- Follow the steps to deploy a model as described in Deploying models on the model serving platform.
- In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
--speculative-model=[ngram] --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS> --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX> --use-v2-block-manager
--speculative-model=[ngram] --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS> --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX> --use-v2-block-managerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<NUM_SPECULATIVE_TOKENS>and<NGRAM_PROMPT_LOOKUP_MAX>with your own values.NoteInferencing throughput varies depending on the model used for speculating with n-grams.
To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<path_to_speculative_model>and<path_to_original_model>with the paths to the speculative model and original model on your S3-compatible object storage. -
Replace
<NUM_SPECULATIVE_TOKENS>with your own value.
-
Replace
To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
--trust-remote-code
--trust-remote-codeCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteOnly use the
--trust-remote-codeargument with models from trusted sources.- Click Deploy.
Verification
If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer <token>"
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer <token>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3. Adding a custom model-serving runtime 링크 복사링크가 클립보드에 복사되었습니다!
A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.
As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the model serving platform.
Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- You have built your custom runtime and added the image to a container image repository such as Quay.
Procedure
From the OpenShift AI dashboard, click Settings
Model resources and operations Serving runtimes. The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
- In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
- In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
Upload a YAML file
- Click Upload files.
In the file browser, select a YAML file on your computer.
The embedded YAML editor opens and shows the contents of the file that you uploaded.
Enter YAML code directly in the editor
- Click Start from scratch.
- Enter or paste YAML code directly in the embedded editor.
NoteIn many cases, creating a custom runtime will require adding new or custom parameters to the
envsection of theServingRuntimespecification.Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
- Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.
Verification
- The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
2.4. Adding a tested and verified runtime 링크 복사링크가 클립보드에 복사되었습니다!
In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.
You can use the Red Hat OpenShift AI dashboard to add and enable tested and verified runtimes for the model serving platform. You can then choose the runtime when you deploy a model on the model serving platform.
Prerequisites
- You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
- If you are deploying the IBM Z Accelerated for NVIDIA Triton Inference Server runtime, you have access to IBM Cloud Container Registry to pull the container image. For more information about obtaining credentials to the IBM Cloud Container Registry, see Downloading the IBM Z Accelerated for NVIDIA Triton Inference Server container image.
- If you are deploying the IBM Power Accelerated Triton Inference Server runtime, you can access the container image from the Triton Inference Server Quay repository.
Procedure
From the OpenShift AI dashboard, click Settings
Model resources and operations Serving runtimes. The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
- Click Add serving runtime.
- In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
- In the Select the API protocol this runtime supports list, select REST or gRPC.
- Click Start from scratch.
Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Follow these steps to add the NVIDIA Triton Inference Server runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Follow these steps to add the Seldon MLServer runtime:
If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
In the
metadata.namefield, make sure that the value of the runtime you are adding does not match a runtime that you have already added. Optional: To use a custom display name for the runtime that you are adding, add a
metadata.annotations.openshift.io/display-namefield and specify a value, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf you do not configure a custom display name for your runtime, OpenShift AI shows the value of the
metadata.namefield.Click Create.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
- Optional: To edit the runtime, click the action menu (⋮) and select Edit.
Verification
- The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.