Chapter 2. Serving models on the single-model serving platform

2.1. About the single-model serving platform
Copy link

For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

2.2. Components
Copy link

KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.

2.3. Installation options
Copy link

To install the single-model serving platform, you have the following options:

Automated installation

If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.

For more information about automated installation, see Configuring automated installation of KServe.

Manual installation

If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

For more information about manual installation, see Manually installing KServe.

2.4. Authorization
Copy link

You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authentication for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.

To add Authorino as an authorization provider on the single-model serving platform, you have the following options:

If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
If you need to manually install the single-model serving platform, you must also manually configure Authorino.

For guidance on choosing an installation option for the single-model serving platform, see Installation options.

2.5. Monitoring
Copy link

You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.

2.6. Model-serving runtimes
Copy link

You can serve models on the single-model serving platform by using model-serving runtimes. The configuration of a model-serving runtime is defined by the ServingRuntime and InferenceService custom resource definitions (CRDs).

2.6.1. ServingRuntime
Copy link

The ServingRuntime CRD creates a serving runtime, an environment for deploying and managing a model. It creates the templates for pods that dynamically load and unload models of various formats and also exposes a service endpoint for inferencing requests.

The following YAML configuration is an example of the vLLM ServingRuntime for KServe model-serving runtime. The configuration includes various flags, environment variables and command-line arguments.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' 
    openshift.io/display-name: vLLM ServingRuntime for KServe 
  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
  namespace: <namespace>
spec:
  annotations:
    prometheus.io/path: /metrics 
    prometheus.io/port: "8080" 
  containers:
    - args:
        - --port=8080
        - --model=/mnt/models 
        - --served-model-name={{.Name}} 
      command: 
        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      image: quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21 
      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
  multiModel: false 
  supportedModelFormats: 
    - autoSelect: true
      name: vLLM

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'

1


    openshift.io/display-name: vLLM ServingRuntime for KServe

2


  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
  namespace: <namespace>
spec:
  annotations:
    prometheus.io/path: /metrics

3


    prometheus.io/port: "8080"

4


  containers:
    - args:
        - --port=8080
        - --model=/mnt/models

5


        - --served-model-name={{.Name}}

6


      command:

7


        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      image: quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21

8


      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
  multiModel: false

9


  supportedModelFormats:

10


    - autoSelect: true
      name: vLLM

Copy to Clipboard

Toggle word wrap

1: The recommended accelerator to use with the runtime.
2: The name with which the serving runtime is displayed.
3: The endpoint used by Prometheus to scrape metrics for monitoring.
4: The port used by Prometheus to scrape metrics for monitoring.
5: The path to where the model files are stored in the runtime container.
6: Passes the model name that is specified by the {{.Name}} template variable inside the runtime container specification to the runtime environment. The {{.Name}} variable maps to the spec.predictor.name field in the InferenceService metadata object.
7: The entrypoint command that starts the runtime container.
8: The runtime container image used by the serving runtime. This image differs depending on the type of accelerator used.
9: Specifies that the runtime is used for single-model serving.
10: Specifies the model formats supported by the runtime.

2.6.2. InferenceService
Copy link

The InferenceService CRD creates a server or inference service that processes inference queries, passes it to the model, and then returns the inference output.

The inference service also performs the following actions:

Specifies the location and format of the model.
Specifies the serving runtime used to serve the model.
Enables the passthrough route for gRPC or REST inference.
Defines HTTP or gRPC endpoints for the deployed model.

The following example shows the InferenceService YAML configuration file that is generated when deploying a granite model with the vLLM runtime:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vllm-runtime
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vllm-runtime
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Copy to Clipboard

Toggle word wrap

2.7. Supported model-serving runtimes
Copy link

OpenShift AI includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

See Supported configurations for a list of the supported model-serving runtimes and deployment requirements.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

2.8. Tested and verified model-serving runtimes
Copy link

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of OpenShift AI.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of OpenShift AI. If a new version of a tested and verified runtime is released in the middle of an OpenShift AI release cycle, it will be tested and verified in an upcoming release.

See Supported configurations for a list of tested and verified runtimes in OpenShift AI.

Note

Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

For more information, see Tested and verified runtimes in OpenShift AI.

Additional resources

Inference endpoints

2.9. Inference endpoints
Copy link

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.

2.9.1. Caikit TGIS ServingRuntime for KServe
Copy link

:443/api/v1/task/text-generation
:443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.2. Caikit Standalone ServingRuntime for KServe
Copy link

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

/api/v1/task/embedding
/api/v1/task/embedding-tasks
/api/v1/task/sentence-similarity
/api/v1/task/sentence-similarity-tasks
/api/v1/task/rerank
/api/v1/task/rerank-tasks
/info/models
/info/version
/info/runtime

gRPC endpoints

:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
:443 caikit.runtime.info.InfoService/GetModelsInfo
:443 caikit.runtime.info.InfoService/GetRuntimeInfo

Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

Example command

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.3. TGIS Standalone ServingRuntime for KServe
Copy link

Important

The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see OpenShift AI release notes.

:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream
Note
To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI text-generation-inference repository.

Example command

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

Copy to Clipboard

Toggle word wrap

2.9.4. OpenVINO Model Server
Copy link

/v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe
Copy link

:443/version
:443/docs
:443/v1/models
:443/v1/chat/completions
:443/v1/completions
:443/v1/embeddings
:443/tokenize
:443/detokenize
Note
- The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
- To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
- As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.
  
  containers: - args: - --chat-template=<CHAT_TEMPLATE>
  
  Copy to Clipboard Toggle word wrap
  
  You can use the chat templates that are available as .jinja files here or with the vLLM image under /app/data/template. For more information, see Chat templates.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
Copy link

See vLLM NVIDIA GPU ServingRuntime for KServe.

2.9.7. vLLM AMD GPU ServingRuntime for KServe
Copy link

See vLLM NVIDIA GPU ServingRuntime for KServe.

2.9.8. NVIDIA Triton Inference Server
Copy link

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Note

ModelMesh does not support the following REST endpoints:

v2/health/live
v2/health/ready
v2/models/<model_name>[/versions/]/ready

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Copy to Clipboard

Toggle word wrap

2.9.9. Seldon MLServer
Copy link

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Copy to Clipboard

Toggle word wrap

2.10. About KServe deployment modes
Copy link

You can deploy models in either advanced or standard deployment mode.

Advanced deployment mode uses Knative Serverless. By default, KServe integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh to deploy models on the single-model serving platform. Red Hat Serverless is based on the open source Knative project and requires the Red Hat OpenShift Serverless Operator.

Alternatively, you can use standard deployment mode, which uses KServe RawDeployment mode and does not require the Red Hat OpenShift Serverless Operator, Red Hat OpenShift Service Mesh, or Authorino.

If you configure KServe for advanced deployment mode, you can set up your data science project to serve models in both advanced and standard deployment mode. However, if you configure KServe for only standard deployment mode, you can only use standard deployment mode.

There are both advantages and disadvantages to using each of these deployment modes:

2.10.1. Advanced mode
Copy link

Advantages:

Enables autoscaling based on request volume:
- Resources scale up automatically when receiving incoming requests.
- Optimizes resource usage and maintains performance during peak times.
Supports scale down to and from zero using Knative:
- Allows resources to scale down completely when there are no incoming requests.
- Saves costs by not running idle resources.

Disadvantages:

Has customization limitations:
- Serverless is backed by Knative and implicitly inherits the same design choices, such as when mounting multiple volumes.
Dependency on Knative for scaling:
- Introduces additional complexity in setup and management compared to traditional scaling methods.
Cluster scoped component:
- If the cluster already has Serverless configured, you must manually configure the cluster to make it work with OpenShift AI.

2.10.2. Standard mode
Copy link

Advantages:

Enables deployment with Kubernetes resources, such as Deployment, Service, Route, and Horizontal Pod Autoscaler, without additional dependencies like Red Hat Serverless, Red Hat Service Mesh, and Authorino.
- The resulting model deployment has a smaller resource footprint compared to advanced mode.
Enables traditional Deployment/Pod configurations, such as mounting multiple volumes, which is not available using Knative.
- Beneficial for applications requiring complex configurations or multiple storage mounts.

Disadvantages:

Does not support automatic scaling:
- Does not support automatic scaling down to zero resources when idle.
- Might result in higher costs during periods of low traffic.

2.11. Deploying models by using the single-model serving platform
Copy link

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Working with certificates.

2.11.1. Enabling the single-model serving platform
Copy link

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have installed KServe.
The spec.dashboardConfig.disableKServe dashboard configuration option is set to false (the default).
For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

Enable the single-model serving platform as follows:
1. In the left menu, click Settings Cluster settings.
2. Locate the Model serving platforms section.
3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
4. Select Standard (No additional dependencies) or Advanced (Serverless and Service Mesh) deployment mode.
  For more information about these deployment mode options, see About KServe deployment modes.
5. Click Save changes.
Enable preinstalled runtimes for the single-model serving platform as follows:
1. In the left menu of the OpenShift AI dashboard, click Settings Serving runtimes.
  The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
  For more information about preinstalled runtimes, see Supported runtimes.
2. Set the runtime that you want to use to Enabled.
  The single-model serving platform is now available for model deployments.

2.11.2. Adding a custom model-serving runtime for the single-model serving platform
Copy link

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

From the OpenShift AI dashboard, click Settings Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  1. Click Upload files.
  2. In the file browser, select a YAML file on your computer.
    The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  1. Click Start from scratch.
  2. Enter or paste YAML code directly in the embedded editor.
Note
In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

2.11.3. Adding a tested and verified model-serving runtime for the single-model serving platform
Copy link

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.

You can use the Red Hat OpenShift AI dashboard to add and enable the NVIDIA Triton Inference Server or the Seldon MLServer runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.

Procedure

From the OpenShift AI dashboard, click Settings Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Click Start from scratch.

Follow these steps to add the NVIDIA Triton Inference Server runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

Copy to Clipboard

Toggle word wrap

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"
volumes:
  - emptyDir: null
    medium: Memory
    sizeLimit: 2Gi
    name: shm

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"
volumes:
  - emptyDir: null
    medium: Memory
    sizeLimit: 2Gi
    name: shm

Copy to Clipboard

Toggle word wrap

Follow these steps to add the Seldon MLServer runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

Copy to Clipboard

Toggle word wrap

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

Copy to Clipboard

Toggle word wrap

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Copy to Clipboard Toggle word wrap
Note
If you do not configure a custom display name for your runtime, OpenShift AI shows the value of the metadata.name field.
Click Create.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

2.11.4. Deploying models on the single-model serving platform
Copy link

When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.

You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have installed KServe.
You have enabled the single-model serving platform.
(Advanced deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider. For more information, see Adding an authorization provider for the single-model serving platform.
You have created a data science project.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated URI in your S3-compatible object storage bucket or Open Container Initiative (OCI) container.
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
To use the vLLM runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs
To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi AI accelerator operator and configuring an hardware profile. For more information, see Setting up Gaudi for OpenShift in the AMD documentation and Working with hardware profiles.
To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
Note
In OpenShift AI, Red Hat supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
To deploy RHEL AI models:
- You have enabled the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
- You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.

Procedure

In the left menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Models tab.
Perform one of the following actions:
- If you see a Single-model serving platform tile, click Deploy model on the tile.
- If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
In the Model deployment name field, enter a unique name for the model that you are deploying.
In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
From the Model framework (name - version) list, select a value.
From the Deployment mode list, select standard or advanced. For more information about deployment modes, see About KServe deployment modes.
In the Number of model server replicas to deploy field, specify a value.
The following options are only available if you have created a hardware profile:
1. From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
  Important
  By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list appears instead of the Accelerator profiles list. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.
2. Optional To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
1. Select Require token authentication.
2. In the Service account name field, enter the service account name that the token will be generated for.
3. To add an additional service account, click Add a service account and enter another service account name.
To specify the location of your model, perform one of the following sets of actions:
- To use an existing connection
  1. Select Existing connection.
  2. From the Name list, select a connection that you previously defined.
    For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.
    Important
    The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
    For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.
    Note
    If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.
- To use a new connection
  1. To define a new connection that your model can access, select New connection.
    In the Add connection modal, select a Connection type. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your OpenShift AI administrator added them.
    The Add connection form opens with fields specific to the connection type that you selected.
  2. Fill in the connection detail fields.
    Important
    If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
(Optional) Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  The Configuration parameters section shows predefined serving runtime parameters, if any are available.
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

2.11.5. Stopping and starting a deployed model
Copy link

You can stop a deployed model to perform edits without consuming cluster resources or triggering a redeployment. When you stop a model, all associated objects are terminated, and the model is unavailable for inference requests. When you start the model again, any pending configuration changes are applied.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have deployed a model in a data science project.

Procedure

From the OpenShift AI dashboard, click Models > Model deployments.
Locate the model that you want to stop or start.
In the Status column for the model, click Stop or Start.
When you stop the model, the status changes to Stopping as the pods are terminated, and then changes to Stopped. When you start the model, the status changes to Starting as new pods are created, and then changes to Running.

2.11.6. Deploying models by using multiple GPU nodes
Copy link

Deploy models across multiple GPU nodes to handle large models, such as large language models (LLMs).

You can serve models on Red Hat OpenShift AI across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime custom runtime, which uses the same image as the vLLM NVIDIA GPU ServingRuntime for KServe runtime and also includes information necessary for multi-GPU inferencing.

You can deploy the model from a persistent volume claim (PVC) or from an Open Container Initiative (OCI) container image.

Important

Deploying models by using multiple GPU nodes is currently available in Red Hat OpenShift AI as a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI (Red Hat OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Enabling accelerators.
- You are using an NVIDIA GPU (nvidia.com/gpu).
- You have specified the GPU type through either the ServingRuntime or InferenceService. If the GPU type specified in the ServingRuntime differs from what is set in the InferenceService, both GPU types are assigned to the resource and can cause errors.
You have enabled KServe on your cluster.
You have only one head pod in your setup. Do not adjust the replica count using the min_replicas or max_replicas settings in the InferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster.
To deploy from a PVC: You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.
To deploy from an OCI container image:
- You have stored a model in an OCI container image.
- If the model is stored in a private OCI repository, you have configured an image pull secret.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap
Select or create a namespace for deploying the model. For example, run the following command to create the kserve-demo namespace:
```
oc new-project kserve-demo
```
```
oc new-project kserve-demo
```
Copy to Clipboard Toggle word wrap

(Deploying a model from a PVC only) Create a PVC for model storage in the namespace where you want to deploy the model. Create a storage class using Filesystem volumeMode and use this storage class for your PVC. The storage size must be larger than the size of the model files on disk. For example:

Note

If you have already configured a PVC or are deploying a model from an OCI container image, you can skip this step.

kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: granite-8b-code-base-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: <model size>
  storageClassName: <storage class>

kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: granite-8b-code-base-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: <model size>
  storageClassName: <storage class>

Copy to Clipboard

Toggle word wrap

Create a pod to download the model to the PVC you created. Update the sample YAML with your bucket name, model path, and credentials:

apiVersion: v1
kind: Pod
metadata:
  name: download-granite-8b-code
  labels:
    name: download-granite-8b-code
spec:
  volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: granite-8b-code-base-pvc
  restartPolicy: Never
  initContainers:
    - name: fix-volume-permissions
      image: quay.io/quay/busybox@sha256:92f3298bf80a1ba949140d77987f5de081f010337880cd771f7e7fc928f8c74d
      command: ["sh"]
      args: ["-c", "mkdir -p /mnt/models/$(MODEL_PATH) && chmod -R 777 /mnt/models"] 
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume
      env:
        - name: MODEL_PATH
          value: <model path> 
  containers:
    - resources:
        requests:
          memory: 40Gi
      name: download-model
      imagePullPolicy: IfNotPresent
      image: quay.io/opendatahub/kserve-storage-initializer:v0.14 
      args:
        - 's3://$(BUCKET_NAME)/$(MODEL_PATH)/'
        - /mnt/models/$(MODEL_PATH)
      env:
        - name: AWS_ACCESS_KEY_ID
          value: <id> 
        - name: AWS_SECRET_ACCESS_KEY
          value: <secret> 
        - name: BUCKET_NAME
          value: <bucket_name> 
        - name: MODEL_PATH
          value: <model path> 
        - name: S3_USE_HTTPS
          value: "1"
        - name: AWS_ENDPOINT_URL
          value: <AWS endpoint> 
        - name: awsAnonymousCredential
          value: 'false'
        - name: AWS_DEFAULT_REGION
          value: <region> 
        - name: S3_VERIFY_SSL
          value: 'true' 
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume

apiVersion: v1
kind: Pod
metadata:
  name: download-granite-8b-code
  labels:
    name: download-granite-8b-code
spec:
  volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: granite-8b-code-base-pvc
  restartPolicy: Never
  initContainers:
    - name: fix-volume-permissions
      image: quay.io/quay/busybox@sha256:92f3298bf80a1ba949140d77987f5de081f010337880cd771f7e7fc928f8c74d
      command: ["sh"]
      args: ["-c", "mkdir -p /mnt/models/$(MODEL_PATH) && chmod -R 777 /mnt/models"]

1


      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume
      env:
        - name: MODEL_PATH
          value: <model path>

2


  containers:
    - resources:
        requests:
          memory: 40Gi
      name: download-model
      imagePullPolicy: IfNotPresent
      image: quay.io/opendatahub/kserve-storage-initializer:v0.14

3


      args:
        - 's3://$(BUCKET_NAME)/$(MODEL_PATH)/'
        - /mnt/models/$(MODEL_PATH)
      env:
        - name: AWS_ACCESS_KEY_ID
          value: <id>

4


        - name: AWS_SECRET_ACCESS_KEY
          value: <secret>

5


        - name: BUCKET_NAME
          value: <bucket_name>

6


        - name: MODEL_PATH
          value: <model path>

7


        - name: S3_USE_HTTPS
          value: "1"
        - name: AWS_ENDPOINT_URL
          value: <AWS endpoint>

8


        - name: awsAnonymousCredential
          value: 'false'
        - name: AWS_DEFAULT_REGION
          value: <region>

9


        - name: S3_VERIFY_SSL
          value: 'true'

10


      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume

Copy to Clipboard

Toggle word wrap

1: The chmod operation is permitted only if your pod is running as root. Remove`chmod -R 777` from the arguments if you are not running the pod as root.
2 7: Specify the path to the model.
3: The value for containers.image, located in your InferenceService. To access this value, run the following command: oc get configmap inferenceservice-config -n redhat-ods-operator -oyaml | grep kserve-storage-initializer:
4: The access key ID to your S3 bucket.
5: The secret access key to your S3 bucket.
6: The name of your S3 bucket.
8: The endpoint to your S3 bucket.
9: The region for your S3 bucket if using an AWS S3 bucket. If using other S3-compatible storage, such as ODF or Minio, you can remove the AWS_DEFAULT_REGION environment variable.
10: If you encounter SSL errors, change S3_VERIFY_SSL to false.

Create the vllm-multinode-runtime custom runtime in your project namespace:

oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n kserve-demo -f -

oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n kserve-demo -f -

Copy to Clipboard

Toggle word wrap

Deploy the model using the following InferenceService configuration:
```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: <inference service name>
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: <storage_uri_path> 
    workerSpec: {} 
```
```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: <inference service name>
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: <storage_uri_path> 
```
1
```
    workerSpec: {} 
```
2
Copy to Clipboard Toggle word wrap
1
Specify the path to your model based on your deployment method:
For PVC: pvc://<pvc_name>/<model_path>
For an OCI container image: oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
2
The following configuration can be added to the InferenceService:
workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value of workerSpec.tensorParallelSize is at least 1.
workerSpec.pipelineParallelSize: Determines how many nodes are used to balance the model in deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value of workerSpec.pipelineParallelSize is at least 2. Do not modify this value in production environments.
Note
You may need to specify additional arguments, depending on your environment and model size.
Deploy the model by applying the InferenceService configuration:
```
oc apply -f <inference-service-file.yaml>
```
```
oc apply -f <inference-service-file.yaml>
```
Copy to Clipboard Toggle word wrap

Verification

To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the Ray cluster status, and send a request to the model.

Check the GPU resource status:

Retrieve the pod names for the head and worker nodes:

Get pod name
Check the GPU memory size for both the head and worker pods:

# Get pod name
podName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor-worker --no-headers|cut -d' ' -f1)

oc wait --for=condition=ready pod/${podName} --timeout=300s
# Check the GPU memory size for both the head and worker pods:
echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi

Copy to Clipboard

Toggle word wrap

Sample response

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   33C    P0             71W /  300W |19031MiB /  23028MiB <1>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
         ...
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P0             69W /  300W |18959MiB /  23028MiB <2>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   33C    P0             71W /  300W |19031MiB /  23028MiB <1>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
         ...
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P0             69W /  300W |18959MiB /  23028MiB <2>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Copy to Clipboard

Toggle word wrap

Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is 0MiB.

Verify the status of your InferenceService using the following command: NOTE: In the Technology Preview, you can only use port forwarding for inferencing.

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export MODEL_NAME=granite-8b-code-base-pvc

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export MODEL_NAME=granite-8b-code-base-pvc

Copy to Clipboard

Toggle word wrap

Sample response

   NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
   granite-8b-code-base-pvc   http://granite-8b-code-base-pvc.default.example.com

   NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
   granite-8b-code-base-pvc   http://granite-8b-code-base-pvc.default.example.com

Copy to Clipboard

Toggle word wrap

Send a request to the model to confirm that the model is available for inference:

oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s

oc port-forward $podName 8080:8080 &

curl http://localhost:8080/v1/completions \
       -H "Content-Type: application/json" \
       -d "{
            'model': "$MODEL_NAME",
            'prompt': 'At what temperature does Nitrogen boil?',
            'max_tokens': 100,
            'temperature': 0
        }"

oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s

oc port-forward $podName 8080:8080 &

curl http://localhost:8080/v1/completions \
       -H "Content-Type: application/json" \
       -d "{
            'model': "$MODEL_NAME",
            'prompt': 'At what temperature does Nitrogen boil?',
            'max_tokens': 100,
            'temperature': 0
        }"

Copy to Clipboard

Toggle word wrap

2.11.7. Setting a timeout for KServe
Copy link

When deploying large models or using node autoscaling with KServe, the operation may time out before a model is deployed because the default progress-deadline that KNative Serving sets is 10 minutes.

If a pod using KNative Serving takes longer than 10 minutes to deploy, the pod might be automatically marked as failed. This can happen if you are deploying large models that take longer than 10 minutes to pull from S3-compatible object storage or if you are using node autoscaling to reduce the consumption of GPU nodes.

To resolve this issue, you can set a custom progress-deadline in the KServe InferenceService for your application.

Prerequisites

You have namespace edit access for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
Select the project where you have deployed the model.
In the Administrator perspective, click Home Search.
From the Resources dropdown menu, search for InferenceService.

Under spec.predictor.annotations, modify the serving.knative.dev/progress-deadline with the new timeout:

apiVersion: serving.kserve.io/v1alpha1
kind: InferenceService
metadata:
  name: my-inference-service
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 30m

apiVersion: serving.kserve.io/v1alpha1
kind: InferenceService
metadata:
  name: my-inference-service
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 30m

Copy to Clipboard

Toggle word wrap

Note

Ensure that you set the progress-deadline on the spec.predictor.annotations level, so that the KServe InferenceService can copy the progress-deadline back to the KNative Service object.

2.11.8. Customizing the parameters of a deployed model-serving runtime
Copy link

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note

Customizing the parameters of a runtime only affects the selected model deployment.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have deployed a model on the single-model serving platform.

Procedure

From the OpenShift AI dashboard, click Models Model deployments.
The Model deployments page opens.
Click Stop next to the name of the model you want to customize.
Click the action menu (⋮) and select Edit.
The Configuration parameters section shows predefined serving runtime parameters, if any are available.
Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
After you are done customizing the runtime parameters, click Redeploy to save.
Click Start to deploy the model with your changes.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:
- Checking the InferenceService YAML from the OpenShift Console.
- Using the following command in the OpenShift CLI:
  oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>
  Copy to Clipboard Toggle word wrap

2.11.9. Customizable model serving runtime parameters
Copy link

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Expand

Serving runtime	Resource
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe	Caikit NLP: Configuration TGIS: Model configuration
Caikit Standalone ServingRuntime for KServe	Caikit NLP: Configuration
NVIDIA Triton Inference Server	NVIDIA Triton Inference Server: Model Parameters
OpenVINO Model Server	OpenVINO Model Server Features: Dynamic Input Parameters
Seldon MLServer	MLServer Documentation: Model Settings
[Deprecated] Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe	TGIS: Model configuration
vLLM NVIDIA GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM AMD GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM Intel Gaudi Accelerator ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server

2.11.10. Using OCI containers for model storage
Copy link

As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe.

Using OCI containers for model storage can help you:

Reduce startup times by avoiding downloading the same model multiple times.
Reduce disk space usage by reducing the number of models downloaded locally.
Improve model performance by allowing pre-fetched images.

Using OCI containers for model storage involves the following tasks:

Storing a model in an OCI image.
Deploying a model from an OCI image by using either the user interface or the command line interface. To deploy a model by using:
- The user interface, see Deploying models on the single-model serving platform.
- The command line interface, see Deploying a model stored in an OCI image by using the CLI.

2.11.10.1. Storing a model in an OCI image
Copy link

You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format.

Prerequisites

You have a model in the ONNX format. The example in this procedure uses the MobileNet v2-7 model in ONNX format.
You have installed the Podman tool.

Procedure

In a terminal window on your local machine, create a temporary directory for storing both the model and the support files that you need to create the OCI image:
```
cd $(mktemp -d)
```
```
cd $(mktemp -d)
```
Copy to Clipboard Toggle word wrap
Create a models folder inside the temporary directory:
```
mkdir -p models/1
```
```
mkdir -p models/1
```
Copy to Clipboard Toggle word wrap
Note
This example command specifies the subdirectory 1 because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the 1 subdirectory to use OCI container images.

Download the model and support files:

DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
curl -L $DOWNLOAD_URL -O --output-dir models/1/

DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
curl -L $DOWNLOAD_URL -O --output-dir models/1/

Copy to Clipboard

Toggle word wrap

Use the tree command to confirm that the model files are located in the directory structure as expected:
```
tree
```
```
tree
```
Copy to Clipboard Toggle word wrap
The tree command should return a directory structure similar to the following example:
```
.
├── Containerfile
└── models
    └── 1
        └── mobilenetv2-7.onnx
```
```
.
├── Containerfile
└── models
    └── 1
        └── mobilenetv2-7.onnx
```
Copy to Clipboard Toggle word wrap
Create a Docker file named Containerfile:
Note
- Specify a base image that provides a shell. In the following example, ubi9-micro is the base container image. You cannot specify an empty image that does not provide a shell, such as scratch, because KServe uses the shell to ensure the model files are accessible to the model server.
- Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID.
```
FROM registry.access.redhat.com/ubi9/ubi-micro:latest
COPY --chown=0:0 models /models
RUN chmod -R a=rX /models

# nobody user
USER 65534
```
```
FROM registry.access.redhat.com/ubi9/ubi-micro:latest
COPY --chown=0:0 models /models
RUN chmod -R a=rX /models

# nobody user
USER 65534
```
Copy to Clipboard Toggle word wrap
Use podman build commands to create the OCI container image and upload it to a registry. The following commands use Quay as the registry.
Note
If your repository is private, ensure that you are authenticated to the registry before uploading your container image.
```
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> .
podman push quay.io/<user_name>/<repository_name>:<tag_name>
```
```
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> .
podman push quay.io/<user_name>/<repository_name>:<tag_name>
```
Copy to Clipboard Toggle word wrap

2.11.10.2. Deploying a model stored in an OCI image by using the CLI
Copy link

You can deploy a model that is stored in an OCI image from the command line interface.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note

By default in KServe, models are exposed outside the cluster and not protected with authentication.

Prerequisites

You have stored a model in an OCI image as described in Storing a model in an OCI image.
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
You are logged in to your OpenShift cluster.

Procedure

Create a project to deploy the model:
```
oc new-project oci-model-example
```
```
oc new-project oci-model-example
```
Copy to Clipboard Toggle word wrap
Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```
Copy to Clipboard Toggle word wrap

Verify that the ServingRuntime named kserve-ovms is created:

oc get servingruntimes

oc get servingruntimes

Copy to Clipboard

Toggle word wrap

The command should return output similar to the following:

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

Copy to Clipboard

Toggle word wrap

Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

Copy to Clipboard

Toggle word wrap

For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

Copy to Clipboard

Toggle word wrap

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice

oc get inferenceservice

Copy to Clipboard

Toggle word wrap

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

2.11.11. Using accelerators with vLLM
Copy link

OpenShift AI includes support for NVIDIA, AMD and Intel Gaudi accelerators. OpenShift AI also includes preinstalled model-serving runtimes that provide accelerator support.

2.11.11.1. NVIDIA GPUs
Copy link

You can serve models with NVIDIA graphics processing units (GPUs) by using the vLLM NVIDIA GPU ServingRuntime for KServe runtime. To use the runtime, you must enable GPU support in OpenShift AI. This includes installing and configuring the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.

2.11.11.2. Intel Gaudi accelerators
Copy link

You can serve models with Intel Gaudi accelerators by using the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime. To use the runtime, you must enable hybrid processing support (HPU) support in OpenShift AI. This includes installing the Intel Gaudi AI accelerator operator and configuring a hardware profile. For more information, see Setting up Gaudi for OpenShift and Working with hardware profiles.

For information about recommended vLLM parameters, environment variables, supported configurations and more, see vLLM with Intel® Gaudi® AI Accelerators.

Note

Warm-up is a model initialization and performance optimization step that is useful for reducing cold-start delays and first-inference latency. Depending on the model size, warm-up can lead to longer model loading times.

While highly recommended in production environments to avoid performance limitations, you can choose to skip warm-up for non-production environments to reduce model loading times and accelerate model development and testing cycles. To skip warm-up, follow the steps described in Customizing the parameters of a deployed model-serving runtime to add the following environment variable in the Configuration parameters section of your model deployment:

`VLLM_SKIP_WARMUP="true"`

`VLLM_SKIP_WARMUP="true"`

Copy to Clipboard

Toggle word wrap

2.11.11.3. AMD GPUs
Copy link

You can serve models with AMD GPUs by using the vLLM AMD GPU ServingRuntime for KServe runtime. To use the runtime, you must enable support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift in the AMD documentation and Working with hardware profiles.

2.11.12. Customizing the vLLM model-serving runtime
Copy link

In certain cases, you may need to add additional flags or environment variables to the vLLM ServingRuntime for KServe runtime to deploy a family of LLMs.

The following procedure describes customizing the vLLM model-serving runtime to deploy a Llama, Granite or Mistral model.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
For Llama model deployment, you have downloaded a meta-llama-3 model to your object storage.
For Granite model deployment, you have downloaded a granite-7b-instruct or granite-20B-code-instruct model to your object storage.
For Mistral model deployment, you have downloaded a mistral-7B-Instruct-v0.3 model to your object storage.
You have enabled the vLLM ServingRuntime for KServe runtime.
You have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select vLLM ServingRuntime for KServe.
If you are deploying a meta-llama-3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
–-distributed-executor-backend=mp 
--max-model-len=6144 
```
```
–-distributed-executor-backend=mp 
```
1
```
--max-model-len=6144 
```
2
Copy to Clipboard Toggle word wrap
1
Sets the backend to multiprocessing for distributed model workers
2
Sets the maximum context length of the model to 6144 tokens
If you are deploying a granite-7B-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 
```
```
--distributed-executor-backend=mp 
```
1
Copy to Clipboard Toggle word wrap
1
Sets the backend to multiprocessing for distributed model workers
If you are deploying a granite-20B-code-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 
–-tensor-parallel-size=4 
--max-model-len=6448 
```
```
--distributed-executor-backend=mp 
```
1
```
–-tensor-parallel-size=4 
```
2
```
--max-model-len=6448 
```
3
Copy to Clipboard Toggle word wrap
1
Sets the backend to multiprocessing for distributed model workers
2
Distributes inference across 4 GPUs in a single node
3
Sets the maximum context length of the model to 6448 tokens
If you are deploying a mistral-7B-Instruct-v0.3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 
--max-model-len=15344 
```
```
--distributed-executor-backend=mp 
```
1
```
--max-model-len=15344 
```
2
Copy to Clipboard Toggle word wrap
1
Sets the backend to multiprocessing for distributed model workers
2
Sets the maximum context length of the model to 15344 tokens
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

For granite models, use the following example command to verify API requests to your deployed model:

curl -q -X 'POST' \
    "https://<inference_endpoint_url>:443/v1/chat/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
    \"model\": \"<model_name>\",
    \"prompt\": \"<prompt>",
    \"max_tokens\": <max_tokens>,
    \"temperature\": <temperature>
    }"

curl -q -X 'POST' \
    "https://<inference_endpoint_url>:443/v1/chat/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
    \"model\": \"<model_name>\",
    \"prompt\": \"<prompt>",
    \"max_tokens\": <max_tokens>,
    \"temperature\": <temperature>
    }"

Copy to Clipboard

Toggle word wrap

2.12. Making inference requests to models deployed on the single-model serving platform
Copy link

When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.

2.12.1. Accessing the authentication token for a deployed model
Copy link

If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have deployed a model by using the single-model serving platform.

Procedure

From the OpenShift AI dashboard, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
Click the Models tab.
In the Models and model servers list, expand the section for your model.
Your authentication token is shown in the Token authentication section, in the Token secret field.
Optional: To copy the authentication token for use in an inference request, click the Copy button ( ) next to the token value.

2.12.2. Accessing the inference endpoint for a deployed model
Copy link

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have deployed a model by using the single-model serving platform.
If you enabled token authentication for your deployed model, you have the associated token value.

Procedure

From the OpenShift AI dashboard, click Models Model deployments.
The inference endpoint for the model is shown in the Inference endpoint field.
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
Use the endpoint to make API requests to your deployed model.

2.13. Viewing model-serving runtime metrics for the single-model serving platform
Copy link

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

You have access to the OpenShift cluster as a developer or as a user with view permissions for the project that you are viewing metrics for.
You are familiar with querying metrics in user-defined projects. See Monitoring project and application metrics using the Developer perspective in Red Hat OpenShift Dedicated or Monitoring project and application metrics using the Developer perspective in Red Hat OpenShift Service on AWS.

Procedure

Log in to the OpenShift web console.
Switch to the Developer perspective.
In the left menu, click Observe.
As described in Monitoring your project metrics in Red Hat OpenShift Dedicated or Monitoring your project metrics in Red Hat OpenShift Service on AWS, use the web console to run queries for caikit_*, tgi_*, ovms_* and vllm:* model-serving runtime metrics. You can also run queries for istio_* metrics that are related to OpenShift Service Mesh. Some examples are shown.
1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
  sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
  sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
  sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
  sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap

2.14. Monitoring model performance
Copy link

In the single-model serving platform, you can view performance metrics for a specific model that is deployed on the platform.

2.14.1. Viewing performance metrics for a deployed model
Copy link

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

You have installed Red Hat OpenShift AI.
You have logged in to Red Hat OpenShift AI.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
Copy to Clipboard Toggle word wrap
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have deployed a model on the single-model serving platform by using a preinstalled runtime.
Note
Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

From the OpenShift AI dashboard navigation menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Models tab.
Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

2.14.2. Deploying a Grafana metrics dashboard
Copy link

You can deploy a Grafana metrics dashboard for User Workload Monitoring (UWM) to monitor performance and resource usage metrics for models deployed on the single-model serving platform.

You can create a Kustomize overlay, similar to this example. Use the overlay to deploy preconfigured metrics dashboards for models deployed with OpenVino Model Server (OVMS) and vLLM.

Prerequisites

You have cluster admin privileges for your OpenShift cluster.
You have installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI (OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
You have created an overlay to deploy a Grafana instance, similar to this example.
Note
To view GPU metrics, you must enable the NVIDIA GPU monitoring dashboard as described in Enabling the GPU monitoring dashboard. The GPU monitoring dashboard provides a comprehensive view of GPU utilization, memory usage, and other metrics for your GPU nodes.

Procedure

In a terminal window, log in to the OpenShift CLI as a cluster administrator.
If you have not already created the overlay to install the Grafana operator and metrics dashboards, refer to the RHOAI UWM repository to create it.
Install the Grafana instance and metrics dashboards on your OpenShift cluster with the overlay that you created. Replace <overlay-name> with the name of your overlay.
```
oc apply -k overlays/<overlay-name>
```
```
oc apply -k overlays/<overlay-name>
```
Copy to Clipboard Toggle word wrap
Retrieve the URL of the Grafana instance. Replace <namespace> with the namespace that contains the Grafana instance.
```
oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}'
```
```
oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}'
```
Copy to Clipboard Toggle word wrap
Use the URL to access the Grafana instance:
```
grafana-<namespace>.apps.example-openshift.com
```
```
grafana-<namespace>.apps.example-openshift.com
```
Copy to Clipboard Toggle word wrap

Verification

You can access the preconfigured dashboards available for KServe, vLLM and OVMS on the Grafana instance.

2.14.3. Deploying a vLLM/GPU metrics dashboard on a Grafana instance
Copy link

Deploy Grafana boards to monitor accelerator and vLLM performance metrics.

Prerequisites

You have deployed a Grafana metrics dashboard, as described in Deploying a Grafana metrics dashboard.
You can access a Grafana instance.
You have installed envsubst, a command-line tool used to substitute environment variables in configuration files. For more information, see the GNU gettext documentation.

Procedure

Define a GrafanaDashboard object in a YAML file, similar to the following examples:
1. To monitor NVIDIA accelerator metrics, see nvidia-vllm-dashboard.yaml.
2. To monitor AMD accelerator metrics, see amd-vllm-dashboard.yaml.
3. To monitor Intel accelerator metrics, see gaudi-vllm-dashboard.yaml.
4. To monitor vLLM metrics, see grafana-vllm-dashboard.yaml.
Create an inputs.env file similar to the following example. Replace the NAMESPACE and MODEL_NAME parameters with your own values:
```
NAMESPACE=<namespace> 
MODEL_NAME=<model-name> 
```
```
NAMESPACE=<namespace> 
```
1
```
MODEL_NAME=<model-name> 
```
2
Copy to Clipboard Toggle word wrap
1
NAMESPACE is the target namespace where the model will be deployed.
2
MODEL_NAME is the model name as defined in your InferenceService. The model name is also used to filter the pod name in the Grafana dashboard.
Replace the NAMESPACE and MODEL_NAME parameters in your YAML file with the values from the inputs.env file by performing the following actions:
1. Export the parameters described in the inputs.env as environment variables:
  export $(cat inputs.env | xargs)
  Copy to Clipboard Toggle word wrap
2. Update the following YAML file, replacing the ${NAMESPACE} and ${MODEL_NAME} variables with the values of the exported environment variables, and dashboard_template.yaml with the name of the GrafanaDashboard object YAML file that you created earlier:
  envsubst '${NAMESPACE} ${MODEL_NAME}' < dashboard_template.yaml > dashboard_template-replaced.yaml
  Copy to Clipboard Toggle word wrap
Confirm that your YAML file contains updated values.

Deploy the dashboard object:

oc create -f dashboard_template-replaced.yaml

oc create -f dashboard_template-replaced.yaml

Copy to Clipboard

Toggle word wrap

Verification

You can see the accelerator and vLLM metrics dashboard on your Grafana instance.

2.14.4. Grafana metrics
Copy link

You can use Grafana boards to monitor the accelerator and vLLM performance metrics. The datasource, instance and gpu are variables defined inside the board.

2.14.4.1. Accelerator metrics
Copy link

Track metrics on your accelerators to ensure the health of the hardware.

NVIDIA GPU utilization

Tracks the percentage of time the GPU is actively processing tasks, indicating GPU workload levels.

Query

DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}

DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}

Copy to Clipboard

Toggle word wrap

NVIDIA GPU memory utilization

Compares memory usage against free memory, which is critical for identifying memory bottlenecks in GPU-heavy workloads.

Query

DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}

DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}

Copy to Clipboard

Toggle word wrap

Sum

sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})

sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})

Copy to Clipboard

Toggle word wrap

NVIDIA GPU temperature

Ensures the GPU operates within safe thermal limits to prevent hardware degradation.

Query

DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}

DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}

Copy to Clipboard

Toggle word wrap

Avg

avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})

avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})

Copy to Clipboard

Toggle word wrap

NVIDIA GPU throttling

GPU throttling occurs when the GPU automatically reduces the clock to avoid damage from overheating.

You can access the following metrics to identify GPU throttling:

GPU temperature: Monitor the GPU temperature. Throttling often occurs when the GPU reaches a certain temperature, for example, 85-90°C.
SM clock speed: Monitor the core clock speed. A significant drop in the clock speed while the GPU is under load indicates throttling.

2.14.4.2. CPU metrics
Copy link

You can track metrics on your CPU to ensure the health of the hardware.

CPU utilization

Tracks CPU usage to identify workloads that are CPU-bound.

Query

sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)

sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)

Copy to Clipboard

Toggle word wrap

CPU-GPU bottlenecks

A combination of CPU throttling and GPU usage metrics to identify resource allocation inefficiencies. The following table outlines the combination of CPU throttling and GPU utilizations, and what these metrics mean for your environment:

Expand

CPU throttling	GPU utilization	Meaning
Low	High	System well-balanced. GPU is fully used without CPU constraints.
High	Low	CPU resources are constrained. The CPU is unable to keep up with the GPU’s processing demands, and the GPU may be underused.
High	High	Workload is increasing for both CPU and GPU, and you might need to scale up resources.

Query

sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])

sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])

Copy to Clipboard

Toggle word wrap

2.14.4.3. vLLM metrics
Copy link

You can track metrics related to your vLLM model.

GPU and CPU cache utilization

Tracks the percentage of GPU memory used by the vLLM model, providing insights into memory efficiency.

Query

sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])

sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])

Copy to Clipboard

Toggle word wrap

Running requests

The number of requests actively being processed. Helps monitor workload concurrency.

num_requests_running{namespace="$namespace", pod=~"$model_name.*"}

num_requests_running{namespace="$namespace", pod=~"$model_name.*"}

Copy to Clipboard

Toggle word wrap

Waiting requests

Tracks requests in the queue, indicating system saturation.

num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}

num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}

Copy to Clipboard

Toggle word wrap

Prefix cache hit rates

High hit rates imply efficient reuse of cached computations, optimizing resource usage.

Queries

vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

Copy to Clipboard

Toggle word wrap

Request total count

Query

vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

Copy to Clipboard

Toggle word wrap

The request ended because it reached the maximum token limit set for the model inference.

Query

vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

Copy to Clipboard

Toggle word wrap

The request completed naturally based on the model’s output or a stop condition, for example, the end of a sentence or token completion.

End-to-end latency: Measures the overall time to process a request for an optimal user experience.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:e2e_request_latency_seconds_sum{namespace="$namespace", pod=~"$model_name.*",model_name="$model_name"}[5m])
rate(vllm:e2e_request_latency_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:e2e_request_latency_seconds_sum{namespace="$namespace", pod=~"$model_name.*",model_name="$model_name"}[5m])
rate(vllm:e2e_request_latency_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Copy to Clipboard

Toggle word wrap

Time to first token (TTFT) latency

The time taken to generate the first token in a response.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_to_first_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_to_first_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_to_first_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_to_first_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Copy to Clipboard

Toggle word wrap

Time per output token (TPOT) latency

The average time taken to generate each output token.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_per_output_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_per_output_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_per_output_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_per_output_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Copy to Clipboard

Toggle word wrap

Prompt token throughput and generation throughput

Tracks the speed of processing prompt tokens for LLM optimization.

Queries

rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Copy to Clipboard

Toggle word wrap

Total tokens generated: Measures the efficiency of generating response tokens, critical for real-time applications.

Query

sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})

sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})

Copy to Clipboard

Toggle word wrap

2.14.5. Configuring metrics-based autoscaling
Copy link

Important

Metrics-based autoscaling is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Knative-based autoscaling is not available in standard deployment mode. However, you can enable metrics-based autoscaling for an inference service in standard deployment mode. Metrics-based autoscaling helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.

To set up autoscaling for your inference service in standard deployments, install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then use various model runtime metrics available in OpenShift Monitoring to trigger autoscaling of your inference service, such as KVCache utilization, Time to First Token (TTFT), and Concurrency.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have installed the CMA operator on your cluster. For more information, see Installing the custom metrics autoscaler.
Note
- You must configure the KedaController resource after installing the CMA operator.
- The odh-controller automatically creates the TriggerAuthentication, ServiceAccount, Role, RoleBinding, and Secret resources to allow CMA access to OpenShift Monitoring metrics.
You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see Configuring user workload monitoring.
You have deployed a model on the single-model serving platform in standard deployment mode.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Home Search.
Select the project where you have deployed your model.
From the Resources dropdown menu, select InferenceService.
Click the InferenceService for your deployed model and then click YAML.

Under spec.predictor, define a metric-based autoscaling policy similar to the following example:

kind: InferenceService
metadata:
  name: my-inference-service
  namespace: my-namespace
  annotations:
    serving.kserve.io/autoscalerClass: keda
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    autoscaling:
      metrics:
        - type: External
          external:
            metric:
              backend: "prometheus"
              serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092"
              query: vllm:num_requests_waiting
          authenticationRef:
            name: inference-prometheus-auth
          authModes: bearer
          target:
            type: Value
            value: 2

kind: InferenceService
metadata:
  name: my-inference-service
  namespace: my-namespace
  annotations:
    serving.kserve.io/autoscalerClass: keda
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    autoscaling:
      metrics:
        - type: External
          external:
            metric:
              backend: "prometheus"
              serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092"
              query: vllm:num_requests_waiting
          authenticationRef:
            name: inference-prometheus-auth
          authModes: bearer
          target:
            type: Value
            value: 2

Copy to Clipboard

Toggle word wrap

The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the vllm:num_requests_waiting metric.

Click Save.

Verification

Confirm that the KEDA ScaledObject resource is created:
```
oc get scaledobject -n <namespace>
```
```
oc get scaledobject -n <namespace>
```
Copy to Clipboard Toggle word wrap

2.15. Optimizing model-serving runtimes
Copy link

You can optionally enhance the preinstalled model-serving runtimes available in OpenShift AI to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

2.15.1. Enabling speculative decoding and multi-modal inferencing
Copy link

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
Copy to Clipboard Toggle word wrap
1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.
  Note
  Inferencing throughput varies depending on the model used for speculating with n-grams.

To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager

--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager

Copy to Clipboard

Toggle word wrap

Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
Replace <NUM_SPECULATIVE_TOKENS> with your own value.

To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--trust-remote-code
```
```
--trust-remote-code
```
Copy to Clipboard Toggle word wrap
Note
Only use the --trust-remote-code argument with models from trusted sources.
Click Deploy.

Verification

If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"

Copy to Clipboard

Toggle word wrap

If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

Copy to Clipboard

Toggle word wrap

2.16. Performance optimization and tuning
Copy link

2.16.1. Determining GPU requirements for LLM-powered applications
Copy link

There are several factors to consider when choosing GPUs for applications powered by a Large Language Model (LLM) hosted on OpenShift AI.

The following guidelines help you determine the hardware requirements for your application, depending on the size and expected usage of your model.

Estimating memory needs: A general rule of thumb is that a model with N parameters in 16-bit precision requires approximately 2N bytes of GPU memory. For example, an 8-billion-parameter model requires around 16GB of GPU memory, while a 70-billion-parameter model requires around 140GB.
Quantization: To reduce memory requirements and potentially improve throughput, you can use quantization to load or run the model at lower-precision formats such as INT8, FP8, or INT4. This reduces the memory footprint at the expense of a slight reduction in model accuracy.
Note
The vLLM ServingRuntime for KServe model-serving runtime supports several quantization methods. For more information about supported implementations and compatible hardware, see Supported hardware for quantization kernels.
Additional memory for key-value cache: In addition to model weights, GPU memory is also needed to store the attention key-value (KV) cache, which increases with the number of requests and the sequence length of each request. This can impact performance in real-time applications, especially for larger models.
Recommended GPU configurations:
- Small Models (1B–8B parameters): For models in the range, a GPU with 24GB of memory is generally sufficient to support a small number of concurrent users.
- Medium Models (10B–34B parameters):
  - Models under 20B parameters require at least 48GB of GPU memory.
  - Models that are between 20B - 34B parameters require at least 80GB or more of memory in a single GPU.
- Large Models (70B parameters): Models in this range may need to be distributed across multiple GPUs by using tensor parallelism techniques. Tensor parallelism allows the model to span multiple GPUs, improving inter-token latency and increasing the maximum batch size by freeing up additional memory for KV cache. Tensor parallelism works best when GPUs have fast interconnects such as an NVLink.
- Very Large Models (405B parameters): For extremely large models, quantization is recommended to reduce memory demands. You can also distribute the model using pipeline parallelism across multiple GPUs, or even across two servers. This approach allows you to scale beyond the memory limitations of a single server, but requires careful management of inter-server communication for optimal performance.

For best results, start with smaller models and then scale up to larger models as required, using techniques such as parallelism and quantization to meet your performance and memory requirements.

2.16.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications
Copy link

There are additional factors that need to be taken into consideration for text-summarization and RAG applications, as well as for LLM-powered services that process large documents uploaded by users.

Longer Input Sequences: The input sequence length can be significantly longer than in a typical chat application, if each user query includes a large prompt or a large amount of context such as an uploaded document. The longer input sequence length increases the prefill time, the time the model takes to process the initial input sequence before generating a response, which can then lead to a higher Time-to-First-Token (TTFT). A longer TTFT may impact the responsiveness of the application. Minimize this latency for optimal user experience.
KV Cache Usage: Longer sequences require more GPU memory for the key-value (KV) cache. The KV cache stores intermediate attention data to improve model performance during generation. A high KV cache utilization per request requires a hardware setup with sufficient GPU memory. This is particularly crucial if multiple users are querying the model concurrently, as each request adds to the total memory load.
Optimal Hardware Configuration: To maintain responsiveness and avoid memory bottlenecks, select a GPU configuration with sufficient memory. For instance, instead of running an 8B model on a single 24GB GPU, deploying it on a larger GPU (e.g., 48GB or 80GB) or across multiple GPUs can improve performance by providing more memory headroom for the KV cache and reducing inter-token latency. Multi-GPU setups with tensor parallelism can also help manage memory demands and improve efficiency for larger input sequences.

In summary, to ensure optimal responsiveness and scalability for document-based applications, you must prioritize hardware with high GPU memory capacity and also consider multi-GPU configurations to handle the increased memory requirements of long input sequences and KV caching.

2.16.3. Inference performance metrics
Copy link

Latency, throughput and cost per million tokens are key metrics to consider when evaluating the response generation efficiency of a model during inferencing. These metrics provide a comprehensive view of a model’s inference performance and can help balance speed, efficiency, and cost for different use cases.

2.16.3.1. Latency
Copy link

Latency is critical for interactive or real-time use cases, and is measured using the following metrics:

Time-to-First-Token (TTFT): The delay in milliseconds between the initial request and the generation of the first token. This metric is important for streaming responses.
Inter-Token Latency (ITL): The time taken in milliseconds to generate each subsequent token after the first, also relevant for streaming.
Time-Per-Output-Token (TPOT): For non-streaming requests, the average time taken in milliseconds to generate each token in an output sequence.

2.16.3.2. Throughput
Copy link

Throughput measures the overall efficiency of a model server and is expressed with the following metrics:

Tokens per Second (TPS): The total number of tokens generated per second across all active requests.
Requests per Second (RPS): The number of requests processed per second. RPS, like response time, is sensitive to sequence length.

2.16.3.3. Cost per million tokens
Copy link

Cost per Million Tokens measures the cost-effectiveness of a model’s inference, indicating the expense incurred per million tokens generated. This metric helps to assess both the economic feasibility and scalability of deploying the model.

2.16.4. Resolving CUDA out-of-memory errors
Copy link

In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.

Note

The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see OpenShift AI release notes.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.

Procedure

From the OpenShift AI dashboard, click Settings Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Based on the runtime that you used to deploy your model, perform one of the following actions:
- If you used the preinstalled Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.
- If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.
  The embedded YAML editor opens and shows the contents of the custom model-serving runtime.
Add or update the BATCH_SAFETY_MARGIN environment variable and set the value to 30. Similarly, add or update the ESTIMATE_MEMORY_BATCH_SIZE environment variable and set the value to 8.
```
spec:
  containers:
    env:
    - name: BATCH_SAFETY_MARGIN
      value: 30
    - name: ESTIMATE_MEMORY_BATCH
      value: 8
```
```
spec:
  containers:
    env:
    - name: BATCH_SAFETY_MARGIN
      value: 30
    - name: ESTIMATE_MEMORY_BATCH
      value: 8
```
Copy to Clipboard Toggle word wrap
Note
The BATCH_SAFETY_MARGIN parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value of BATCH_SAFETY_MARGIN is 20. The ESTIMATE_MEMORY_BATCH_SIZE parameter sets the batch size used in the memory auto-tuning algorithm. The default value of ESTIMATE_MEMORY_BATCH_SIZE is 16.
Click Update.
The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
To redeploy the model for the parameter updates to take effect, perform the following actions:
1. From the OpenShift AI dashboard, click Models Model deployments.
2. Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.
3. Redeploy the model as described in Deploying models on the single-model serving platform.

Verification

You receive successful responses from the model server and no longer see CUDA OOM errors.

2.17. About the NVIDIA NIM model serving platform
Copy link

You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

2.17.1. Enabling the NVIDIA NIM model serving platform
Copy link

As an OpenShift AI administrator, you can use the Red Hat OpenShift AI dashboard to enable the NVIDIA NIM model serving platform.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have enabled the single-model serving platform. You do not need to enable a preinstalled runtime. For more information about enabling the single-model serving platform, see Enabling the single-model serving platform.
The disableNIMModelServing dashboard configuration option is set to false.
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal. For more information, see NVIDIA GPU Cloud user guide.
Your NCA account is associated with the NVIDIA AI Enterprise Viewer role.
You have generated a personal API key on the NGC portal. For more information, see Generating a Personal API Key.

Procedure

In the left menu of the OpenShift AI dashboard, click Applications Explore.
On the Explore page, find the NVIDIA NIM tile.
Click Enable on the application tile.
Enter your personal API key and then click Submit.

Verification

The NVIDIA NIM application that you enabled appears on the Enabled page.

2.17.2. Deploying models on the NVIDIA NIM model serving platform
Copy link

When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have enabled the NVIDIA NIM model serving platform.
You have created a data science project.
You have enabled support for graphic processing units (GPUs) in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.

Procedure

In the left menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Models tab.
In the Models section, perform one of the following actions:
- On the NVIDIA NIM model serving platform tile, click Select NVIDIA NIM on the tile, and then click Deploy model.
- If you have previously selected the NVIDIA NIM model serving type, the Models page displays NVIDIA model serving enabled on the upper-right corner, along with the Deploy model button. To proceed, click Deploy model.
The Deploy model dialog opens.
Configure properties for deploying your model as follows:
1. In the Model deployment name field, enter a unique name for the deployment.
2. From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy. For more information, see Supported Models
3. In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.
4. In the Number of model server replicas to deploy field, specify a value.
5. From the Model server size list, select a value.
From the Hardware profile list, select a hardware profile.
Important
By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list appears instead of the Accelerator profiles list. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.
Optional: Click Customize resource requests and limit and update the following values:
1. In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
2. In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
3. In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
4. In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
1. Select Require token authentication.
2. In the Service account name field, enter the service account name that the token will be generated for.
3. To add an additional service account, click Add a service account and enter another service account name.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

2.17.3. Customizing model selection options for the NVIDIA NIM model serving platform
Copy link

The NVIDIA NIM model serving platform provides access to all available NVIDIA NIM models from the NVIDIA GPU Cloud (NGC). You can deploy a NIM model by selecting it from the NVIDIA NIM list in the Deploy model dialog. To customize the models that appear in the list, you can create a ConfigMap object specifying your preferred models.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal.
You know the IDs of the NVIDIA NIM models that you want to make available for selection on the NVIDIA NIM model serving platform.
Note
- You can find the model ID from the NGC Catalog. The ID is usually part of the URL path.
- You can also find the model ID by using the NGC CLI. For more information, see NGC CLI reference.
You know the name and namespace of your Account custom resource (CR).

Procedure

In a terminal window, log in to the OpenShift CLI as a cluster administrator as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap

Define a ConfigMap object in a YAML file, similar to the one in the following example, containing the model IDs that you want to make available for selection on the NVIDIA NIM model serving platform:

apiVersion: v1
kind: ConfigMap
metadata:
 name: nvidia-nim-enabled-models
data:
 models: |-
    [
    "mistral-nemo-12b-instruct",
    "llama3-70b-instruct",
    "phind-codellama-34b-v2-instruct",
    "deepseek-r1",
    "qwen-2.5-72b-instruct"
    ]

apiVersion: v1
kind: ConfigMap
metadata:
 name: nvidia-nim-enabled-models
data:
 models: |-
    [
    "mistral-nemo-12b-instruct",
    "llama3-70b-instruct",
    "phind-codellama-34b-v2-instruct",
    "deepseek-r1",
    "qwen-2.5-72b-instruct"
    ]

Copy to Clipboard

Toggle word wrap

Confirm the name and namespace of your Account CR:

oc get account -A

oc get account -A

Copy to Clipboard

Toggle word wrap

You see output similar to the following example:

NAMESPACE         NAME       TEMPLATE  CONFIGMAP  SECRET
redhat-ods-applications  odh-nim-account

NAMESPACE         NAME       TEMPLATE  CONFIGMAP  SECRET
redhat-ods-applications  odh-nim-account

Copy to Clipboard

Toggle word wrap

Deploy the ConfigMap object in the same namespace as your Account CR:
```
oc apply -f <configmap-name> -n <namespace>
```
```
oc apply -f <configmap-name> -n <namespace>
```
Copy to Clipboard Toggle word wrap
Replace <configmap-name> with the name of your YAML file, and <namespace> with the namespace of your Account CR.
Add the ConfigMap object that you previously created to the spec.modelListConfig section of your Account CR:
```
oc patch account <account-name> \
  --type='merge' \
  	-p '{"spec": {"modelListConfig": {"name": "<configmap-name>"}}}'
```
```
oc patch account <account-name> \
  --type='merge' \
  	-p '{"spec": {"modelListConfig": {"name": "<configmap-name>"}}}'
```
Copy to Clipboard Toggle word wrap
Replace <account-name> with the name of your Account CR, and <configmap-name> with your ConfigMap object.
Confirm that the ConfigMap object is added to your Account CR:
```
oc get account <account-name> -o yaml
```
```
oc get account <account-name> -o yaml
```
Copy to Clipboard Toggle word wrap
You see the ConfigMap object in the spec.modelListConfig section of your Account CR, similar to the following output:
```
spec:
 enabledModelsConfig:
 modelListConfig:
  name: <configmap-name>
```
```
spec:
 enabledModelsConfig:
 modelListConfig:
  name: <configmap-name>
```
Copy to Clipboard Toggle word wrap

Verification

Follow the steps to deploy a model as described in Deploying models on the NVIDIA NIM model serving platform to deploy a NIM model. You see that the NVIDIA NIM list in the Deploy model dialog displays your preferred list of models instead of all the models available in the NGC catalog.

2.17.4. Enabling NVIDIA NIM metrics for an existing NIM deployment
Copy link

If you have previously deployed a NIM model in OpenShift AI, and then upgraded to the latest version, you must manually enable NIM metrics for your existing deployment by adding annotations to enable metrics collection and graph generation.

Note

NIM metrics and graphs are automatically enabled for new deployments in the latest version of OpenShift AI.

2.17.4.1. Enabling graph generation for an existing NIM deployment
Copy link

The following procedure describes how to enable graph generation for an existing NIM deployment.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI (Red Hat OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
You have an existing NIM deployment in OpenShift AI.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI.
Confirm the name of the ServingRuntime associated with your NIM deployment:
```
oc get servingruntime -n <namespace>
```
```
oc get servingruntime -n <namespace>
```
Copy to Clipboard Toggle word wrap
Replace <namespace> with the namespace of the project where your NIM model is deployed.
Check for an existing metadata.annotations section in the ServingRuntime configuration:
```
oc get servingruntime -n  <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
```
oc get servingruntime -n  <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
Copy to Clipboard Toggle word wrap
Replace <servingruntime-name> with the name of the ServingRuntime from the previous step.

Perform one of the following actions:

If the metadata.annotations section is not present in the configuration, add the section with the required annotations:

oc patch servingruntime -n <namespace> <servingruntime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations", "value": {"runtimes.opendatahub.io/nvidia-nim": "true"}}]'

oc patch servingruntime -n <namespace> <servingruntime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations", "value": {"runtimes.opendatahub.io/nvidia-nim": "true"}}]'

Copy to Clipboard

Toggle word wrap

You see output similar to the following:

servingruntime.serving.kserve.io/nim-serving-runtime patched

servingruntime.serving.kserve.io/nim-serving-runtime patched

Copy to Clipboard

Toggle word wrap

If there is an existing metadata.annotations section, add the required annotations to the section:

oc patch servingruntime -n <project-namespace> <runtime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations/runtimes.opendatahub.io~1nvidia-nim", "value": "true"}]'

oc patch servingruntime -n <project-namespace> <runtime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations/runtimes.opendatahub.io~1nvidia-nim", "value": "true"}]'

Copy to Clipboard

Toggle word wrap

You see output similar to the following:

servingruntime.serving.kserve.io/nim-serving-runtime patched

servingruntime.serving.kserve.io/nim-serving-runtime patched

Copy to Clipboard

Toggle word wrap

Verification

Confirm that the annotation has been added to the ServingRuntime of your existing NIM deployment.
```
oc get servingruntime -n <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
```
oc get servingruntime -n <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
Copy to Clipboard Toggle word wrap
The annotation that you added appears in the output:
```
...
"runtimes.opendatahub.io/nvidia-nim": "true"
```
```
...
"runtimes.opendatahub.io/nvidia-nim": "true"
```
Copy to Clipboard Toggle word wrap
Note
For metrics to be available for graph generation, you must also enable metrics collection for your deployment. Please see Enabling metrics collection for an existing NIM deployment.

2.17.4.2. Enabling metrics collection for an existing NIM deployment
Copy link

To enable metrics collection for your existing NIM deployment, you must manually add the Prometheus endpoint and port annotations to the InferenceService of your deployment.

The following procedure describes how to add the required Prometheus annotations to the InferenceService of your NIM deployment.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI (Red Hat OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
You have an existing NIM deployment in OpenShift AI.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI.
Confirm the name of the InferenceService associated with your NIM deployment:
```
oc get inferenceservice -n <namespace>
```
```
oc get inferenceservice -n <namespace>
```
Copy to Clipboard Toggle word wrap
Replace <namespace> with the namespace of the project where your NIM model is deployed.
Check if there is an existing spec.predictor.annotations section in the InferenceService configuration:
```
oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'
```
```
oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'
```
Copy to Clipboard Toggle word wrap
Replace <inferenceservice-name> with the name of the InferenceService from the previous step.

Perform one of the following actions:

If the spec.predictor.annotations section does not exist in the configuration, add the section and required annotations:

oc patch inferenceservice -n <namespace> <inference-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations", "value": {"prometheus.io/path": "/metrics", "prometheus.io/port": "8000"}}]'

oc patch inferenceservice -n <namespace> <inference-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations", "value": {"prometheus.io/path": "/metrics", "prometheus.io/port": "8000"}}]'

Copy to Clipboard

Toggle word wrap

The annotation that you added appears in the output:

inferenceservice.serving.kserve.io/nim-serving-runtime patched

inferenceservice.serving.kserve.io/nim-serving-runtime patched

Copy to Clipboard

Toggle word wrap

If there is an existing spec.predictor.annotations section, add the Prometheus annotations to the section:

oc patch inferenceservice -n <namespace> <inference-service-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1path", "value": "/metrics"},
 {"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1port", "value": "8000"}]'

oc patch inferenceservice -n <namespace> <inference-service-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1path", "value": "/metrics"},
 {"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1port", "value": "8000"}]'

Copy to Clipboard

Toggle word wrap

The annotations that you added appears in the output:

inferenceservice.serving.kserve.io/nim-serving-runtime patched

inferenceservice.serving.kserve.io/nim-serving-runtime patched

Copy to Clipboard

Toggle word wrap

Verification

Confirm that the annotations have been added to the InferenceService.

oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'

oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'

Copy to Clipboard

Toggle word wrap

You see the annotation that you added in the output:

{
  "prometheus.io/path": "/metrics",
  "prometheus.io/port": "8000"
}

{
  "prometheus.io/path": "/metrics",
  "prometheus.io/port": "8000"
}

Copy to Clipboard

Toggle word wrap

2.17.5. Viewing NVIDIA NIM metrics for a NIM model
Copy link

In OpenShift AI, you can observe the following NVIDIA NIM metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

GPU cache usage over time (ms)
Current running, waiting, and max requests count
Tokens count
Time to first token
Time per output token
Request outcomes

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

You have enabled the NVIDIA NIM model serving platform.
You have deployed a NIM model on the NVIDIA NIM model serving platform.
The disableKServeMetrics OpenShift AI dashboard configuration option is set to its default value of false:
```
disableKServeMetrics: false
```
```
disableKServeMetrics: false
```
Copy to Clipboard Toggle word wrap
For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

From the OpenShift AI dashboard navigation menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains the NIM model that you want to monitor.
In the project details page, click the Models tab.
Click the NIM model that you want to observe.
On the NIM Metrics tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for NIM metrics.

Verification

The NIM Metrics tab shows graphs of NIM metrics for the deployed NIM model.

Additional resources

NVIDIA NIM observability

2.17.6. Viewing performance metrics for a NIM model
Copy link

You can observe the following performance metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

You have enabled the NVIDIA NIM model serving platform.
You have deployed a NIM model on the NVIDIA NIM model serving platform.
The disableKServeMetrics OpenShift AI dashboard configuration option is set to its default value of false:
```
disableKServeMetrics: false
```
```
disableKServeMetrics: false
```
Copy to Clipboard Toggle word wrap
For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

From the OpenShift AI dashboard navigation menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains the NIM model that you want to monitor.
In the project details page, click the Models tab.
Click the NIM model that you want to observe.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed to show the latest data. You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for performance metrics.

Verification

The Endpoint performance tab shows graphs of performance metrics for the deployed NIM model.

2.1. About the single-model serving platformCopy linkLink copied to clipboard!

2.2. ComponentsCopy linkLink copied to clipboard!

2.3. Installation optionsCopy linkLink copied to clipboard!

2.4. AuthorizationCopy linkLink copied to clipboard!

2.5. MonitoringCopy linkLink copied to clipboard!

2.6. Model-serving runtimesCopy linkLink copied to clipboard!

2.6.1. ServingRuntimeCopy linkLink copied to clipboard!

2.6.2. InferenceServiceCopy linkLink copied to clipboard!

2.7. Supported model-serving runtimesCopy linkLink copied to clipboard!

2.8. Tested and verified model-serving runtimesCopy linkLink copied to clipboard!

2.9. Inference endpointsCopy linkLink copied to clipboard!

2.9.1. Caikit TGIS ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.2. Caikit Standalone ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.3. TGIS Standalone ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.4. OpenVINO Model ServerCopy linkLink copied to clipboard!

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.7. vLLM AMD GPU ServingRuntime for KServeCopy linkLink copied to clipboard!

2.9.8. NVIDIA Triton Inference ServerCopy linkLink copied to clipboard!

2.9.9. Seldon MLServerCopy linkLink copied to clipboard!

2.10. About KServe deployment modesCopy linkLink copied to clipboard!

2.10.1. Advanced modeCopy linkLink copied to clipboard!

2.10.2. Standard modeCopy linkLink copied to clipboard!

2.11. Deploying models by using the single-model serving platformCopy linkLink copied to clipboard!

2.11.1. Enabling the single-model serving platformCopy linkLink copied to clipboard!

2.11.2. Adding a custom model-serving runtime for the single-model serving platformCopy linkLink copied to clipboard!

2.11.3. Adding a tested and verified model-serving runtime for the single-model serving platformCopy linkLink copied to clipboard!

2.11.4. Deploying models on the single-model serving platformCopy linkLink copied to clipboard!

2.11.5. Stopping and starting a deployed modelCopy linkLink copied to clipboard!

2.11.6. Deploying models by using multiple GPU nodesCopy linkLink copied to clipboard!

2.11.7. Setting a timeout for KServeCopy linkLink copied to clipboard!

2.11.8. Customizing the parameters of a deployed model-serving runtimeCopy linkLink copied to clipboard!

2.11.9. Customizable model serving runtime parametersCopy linkLink copied to clipboard!

2.11.10. Using OCI containers for model storageCopy linkLink copied to clipboard!

2.11.10.1. Storing a model in an OCI imageCopy linkLink copied to clipboard!

2.11.10.2. Deploying a model stored in an OCI image by using the CLICopy linkLink copied to clipboard!

2.11.11. Using accelerators with vLLMCopy linkLink copied to clipboard!

2.11.11.1. NVIDIA GPUsCopy linkLink copied to clipboard!

2.11.11.2. Intel Gaudi acceleratorsCopy linkLink copied to clipboard!

2.11.11.3. AMD GPUsCopy linkLink copied to clipboard!

2.11.12. Customizing the vLLM model-serving runtimeCopy linkLink copied to clipboard!

2.12. Making inference requests to models deployed on the single-model serving platformCopy linkLink copied to clipboard!

2.12.1. Accessing the authentication token for a deployed modelCopy linkLink copied to clipboard!

2.12.2. Accessing the inference endpoint for a deployed modelCopy linkLink copied to clipboard!

2.13. Viewing model-serving runtime metrics for the single-model serving platformCopy linkLink copied to clipboard!

2.14. Monitoring model performanceCopy linkLink copied to clipboard!

2.14.1. Viewing performance metrics for a deployed modelCopy linkLink copied to clipboard!

2.14.2. Deploying a Grafana metrics dashboardCopy linkLink copied to clipboard!

2.14.3. Deploying a vLLM/GPU metrics dashboard on a Grafana instanceCopy linkLink copied to clipboard!

2.14.4. Grafana metricsCopy linkLink copied to clipboard!

2.14.4.1. Accelerator metricsCopy linkLink copied to clipboard!

2.14.4.2. CPU metricsCopy linkLink copied to clipboard!

2.14.4.3. vLLM metricsCopy linkLink copied to clipboard!

2.14.5. Configuring metrics-based autoscalingCopy linkLink copied to clipboard!

2.15. Optimizing model-serving runtimesCopy linkLink copied to clipboard!

2.15.1. Enabling speculative decoding and multi-modal inferencingCopy linkLink copied to clipboard!

2.16. Performance optimization and tuningCopy linkLink copied to clipboard!

2.16.1. Determining GPU requirements for LLM-powered applicationsCopy linkLink copied to clipboard!

2.16.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applicationsCopy linkLink copied to clipboard!

2.16.3. Inference performance metricsCopy linkLink copied to clipboard!

2.16.3.1. LatencyCopy linkLink copied to clipboard!

2.16.3.2. ThroughputCopy linkLink copied to clipboard!

2.16.3.3. Cost per million tokensCopy linkLink copied to clipboard!

2.16.4. Resolving CUDA out-of-memory errorsCopy linkLink copied to clipboard!

2.17. About the NVIDIA NIM model serving platformCopy linkLink copied to clipboard!

2.17.1. Enabling the NVIDIA NIM model serving platformCopy linkLink copied to clipboard!

2.17.2. Deploying models on the NVIDIA NIM model serving platformCopy linkLink copied to clipboard!

2.17.3. Customizing model selection options for the NVIDIA NIM model serving platformCopy linkLink copied to clipboard!

2.17.4. Enabling NVIDIA NIM metrics for an existing NIM deploymentCopy linkLink copied to clipboard!

2.17.4.1. Enabling graph generation for an existing NIM deploymentCopy linkLink copied to clipboard!

2.17.4.2. Enabling metrics collection for an existing NIM deploymentCopy linkLink copied to clipboard!

2.17.5. Viewing NVIDIA NIM metrics for a NIM modelCopy linkLink copied to clipboard!

2.17.6. Viewing performance metrics for a NIM modelCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

2.1. About the single-model serving platform
Copy link

2.2. Components
Copy link

2.3. Installation options
Copy link

2.4. Authorization
Copy link

2.5. Monitoring
Copy link

2.6. Model-serving runtimes
Copy link

2.6.1. ServingRuntime
Copy link

2.6.2. InferenceService
Copy link

2.7. Supported model-serving runtimes
Copy link

2.8. Tested and verified model-serving runtimes
Copy link

2.9. Inference endpoints
Copy link

2.9.1. Caikit TGIS ServingRuntime for KServe
Copy link

2.9.2. Caikit Standalone ServingRuntime for KServe
Copy link

2.9.3. TGIS Standalone ServingRuntime for KServe
Copy link

2.9.4. OpenVINO Model Server
Copy link

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe
Copy link

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
Copy link

2.9.7. vLLM AMD GPU ServingRuntime for KServe
Copy link

2.9.8. NVIDIA Triton Inference Server
Copy link

2.9.9. Seldon MLServer
Copy link

2.10. About KServe deployment modes
Copy link

2.10.1. Advanced mode
Copy link

2.10.2. Standard mode
Copy link

2.11. Deploying models by using the single-model serving platform
Copy link

2.11.1. Enabling the single-model serving platform
Copy link

2.11.2. Adding a custom model-serving runtime for the single-model serving platform
Copy link

2.11.3. Adding a tested and verified model-serving runtime for the single-model serving platform
Copy link

2.11.4. Deploying models on the single-model serving platform
Copy link

2.11.5. Stopping and starting a deployed model
Copy link

2.11.6. Deploying models by using multiple GPU nodes
Copy link

2.11.7. Setting a timeout for KServe
Copy link

2.11.8. Customizing the parameters of a deployed model-serving runtime
Copy link

2.11.9. Customizable model serving runtime parameters
Copy link

2.11.10. Using OCI containers for model storage
Copy link

2.11.10.1. Storing a model in an OCI image
Copy link

2.11.10.2. Deploying a model stored in an OCI image by using the CLI
Copy link

2.11.11. Using accelerators with vLLM
Copy link

2.11.11.1. NVIDIA GPUs
Copy link

2.11.11.2. Intel Gaudi accelerators
Copy link

2.11.11.3. AMD GPUs
Copy link

2.11.12. Customizing the vLLM model-serving runtime
Copy link

2.12. Making inference requests to models deployed on the single-model serving platform
Copy link

2.12.1. Accessing the authentication token for a deployed model
Copy link

2.12.2. Accessing the inference endpoint for a deployed model
Copy link

2.13. Viewing model-serving runtime metrics for the single-model serving platform
Copy link

2.14. Monitoring model performance
Copy link

2.14.1. Viewing performance metrics for a deployed model
Copy link

2.14.2. Deploying a Grafana metrics dashboard
Copy link

2.14.3. Deploying a vLLM/GPU metrics dashboard on a Grafana instance
Copy link

2.14.4. Grafana metrics
Copy link

2.14.4.1. Accelerator metrics
Copy link

2.14.4.2. CPU metrics
Copy link

2.14.4.3. vLLM metrics
Copy link

2.14.5. Configuring metrics-based autoscaling
Copy link

2.15. Optimizing model-serving runtimes
Copy link

2.15.1. Enabling speculative decoding and multi-modal inferencing
Copy link

2.16. Performance optimization and tuning
Copy link

2.16.1. Determining GPU requirements for LLM-powered applications
Copy link

2.16.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications
Copy link

2.16.3. Inference performance metrics
Copy link

2.16.3.1. Latency
Copy link

2.16.3.2. Throughput
Copy link

2.16.3.3. Cost per million tokens
Copy link

2.16.4. Resolving CUDA out-of-memory errors
Copy link

2.17. About the NVIDIA NIM model serving platform
Copy link

2.17.1. Enabling the NVIDIA NIM model serving platform
Copy link

2.17.2. Deploying models on the NVIDIA NIM model serving platform
Copy link

2.17.3. Customizing model selection options for the NVIDIA NIM model serving platform
Copy link

2.17.4. Enabling NVIDIA NIM metrics for an existing NIM deployment
Copy link

2.17.4.1. Enabling graph generation for an existing NIM deployment
Copy link

2.17.4.2. Enabling metrics collection for an existing NIM deployment
Copy link

2.17.5. Viewing NVIDIA NIM metrics for a NIM model
Copy link

2.17.6. Viewing performance metrics for a NIM model
Copy link