Chapter 3. Serving large models

3.1. About the single-model serving platform

For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

3.2. Components

KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.

3.3. Installation options

To install the single-model serving platform, you have the following options:

Automated installation

If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.

For more information about automated installation, see Configuring automated installation of KServe.

Manual installation

If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

For more information about manual installation, see Manually installing KServe.

3.4. Authorization

You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authorization for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.

To add Authorino as an authorization provider on the single-model serving platform, you have the following options:

If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
If you need to manually install the single-model serving platform, you must also manually configure Authorino.

For guidance on choosing an installation option for the single-model serving platform, see Installation options.

3.5. Monitoring

You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.

3.6. Model-serving runtimes

You can serve models on the single-model serving platform by using model-serving runtimes. The configuration of a model-serving runtime is defined by the ServingRuntime and InferenceService custom resource definitions (CRDs).

3.6.1. ServingRuntime

The ServingRuntime CRD creates a serving runtime, an environment for deploying and managing a model. It creates the templates for pods that dynamically load and unload models of various formats and also exposes a service endpoint for inferencing requests.

The following YAML configuration is an example of the vLLM ServingRuntime for KServe model-serving runtime. The configuration includes various flags, environment variables and command-line arguments.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' 1
    openshift.io/display-name: vLLM ServingRuntime for KServe 2
  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
spec:
     annotations:
          prometheus.io/path: /metrics 3
          prometheus.io/port: "8080" 4
     containers :
          - args:
               - --port=8080
               - --model=/mnt/models 5
               - --served-model-name={{.Name}} 6
             command: 7
                  - python
                  - '-m'
                  - vllm.entrypoints.openai.api_server
             env:
                  - name: HF_HOME
                     value: /tmp/hf_home
             image: 8
quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21
          name: kserve-container
          ports:
               - containerPort: 8080
                   protocol: TCP
    multiModel: false 9
    supportedModelFormats: 10
        - autoSelect: true
           name: vLLM

1: The recommended accelerator to use with the runtime.
2: The name with which the serving runtime is displayed.
3: The endpoint used by Prometheus to scrape metrics for monitoring.
4: The port used by Prometheus to scrape metrics for monitoring.
5: The path to where the model files are stored in the runtime container.
6: Passes the model name that is specified by the {{.Name}} template variable inside the runtime container specification to the runtime environment. The {{.Name}} variable maps to the spec.predictor.name field in the InferenceService metadata object.
7: The entrypoint command that starts the runtime container.
8: The runtime container image used by the serving runtime. This image differs depending on the type of accelerator used.
9: Specifies that the runtime is used for single-model serving.
10: Specifies the model formats supported by the runtime.

3.6.2. InferenceService

The InferenceService CRD creates a server or inference service that processes inference queries, passes it to the model, and then returns the inference output.

The inference service also performs the following actions:

Specifies the location and format of the model.
Specifies the serving runtime used to serve the model.
Enables the passthrough route for gRPC or REST inference.
Defines HTTP or gRPC endpoints for the deployed model.

The following example shows the InferenceService YAML configuration file that is generated when deploying a granite model with the vLLM runtime:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vLLM ServingRuntime for KServe
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Additional resources

Serving Runtimes

3.7. Supported model-serving runtimes

OpenShift AI includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Table 3.1. Model-serving runtimes
Name	Description	Exported model format
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe (1)	A composite runtime for serving models in the Caikit format	Caikit Text Generation
Caikit Standalone ServingRuntime for KServe (2)	A runtime for serving models in the Caikit embeddings format for embeddings tasks	Caikit Embeddings
OpenVINO Model Server	A scalable, high-performance runtime for serving models that are optimized for Intel architectures	PyTorch, TensorFlow, OpenVINO IR, PaddlePaddle, MXNet, Caffe, Kaldi
Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe (3)	A runtime for serving TGI-enabled models	PyTorch Model Formats
vLLM ServingRuntime for KServe	A high-throughput and memory-efficient inference and serving runtime for large language models	Supported models
vLLM ServingRuntime with Gaudi accelerators support for KServe	A high-throughput and memory-efficient inference and serving runtime that supports Intel Gaudi accelerators	Supported models
vLLM ROCm ServingRuntime for KServe	A high-throughput and memory-efficient inference and serving runtime that supports AMD GPU accelerators	Supported models

The composite Caikit-TGIS runtime is based on Caikit and Text Generation Inference Server (TGIS). To use this runtime, you must convert your models to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
The Caikit Standalone runtime is based on Caikit NLP. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see Tests for text embedding module.
Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model is incompatible in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

Table 3.2. Deployment requirements
Name	Default protocol	Additonal protocol	Model mesh support	Single node OpenShift support	Deployment mode
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe	REST	gRPC	No	Yes	Raw and serverless
Caikit Standalone ServingRuntime for KServe	REST	gRPC	No	Yes	Raw and serverless
OpenVINO Model Server	REST	None	Yes	Yes	Raw and serverless
Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe	gRPC	None	No	Yes	Raw and serverless
vLLM ServingRuntime for KServe	REST	None	No	Yes	Raw and serverless
vLLM ServingRuntime with Gaudi accelerators support for KServe	REST	None	No	Yes	Raw and serverless
vLLM ROCm ServingRuntime for KServe	REST	None	No	Yes	Raw and serverless

Additional resources

Inference endpoints

3.8. Tested and verified model-serving runtimes

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of OpenShift AI.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of OpenShift AI. If a new version of a tested and verified runtime is released in the middle of an OpenShift AI release cycle, it will be tested and verified in an upcoming release.

A list of the tested and verified runtimes and compatible versions is available in the OpenShift AI release notes.

Note

Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

For more information, see Tested and verified runtimes in OpenShift AI.

Table 3.3. Model-serving runtimes
Name	Description	Exported model format
NVIDIA Triton Inference Server	An open-source inference-serving software for fast and scalable AI in applications.	TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more

Table 3.4. Deployment requirements
Name	Default protocol	Additonal protocol	Model mesh support	Single node OpenShift support	Deployment mode
NVIDIA Triton Inference Server	gRPC	REST	Yes	Yes	Raw and serverless

Additional resources

Inference endpoints

3.9. Inference endpoints

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authorization when deploying the model, add the Authorization header and specify a token value.

3.9.1. Caikit TGIS ServingRuntime for KServe

:443/api/v1/task/text-generation
:443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

3.9.2. Caikit Standalone ServingRuntime for KServe

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

/api/v1/task/embedding
/api/v1/task/embedding-tasks
/api/v1/task/sentence-similarity
/api/v1/task/sentence-similarity-tasks
/api/v1/task/rerank
/api/v1/task/rerank-tasks
/info/models
/info/version
/info/runtime

gRPC endpoints

:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
:443 caikit.runtime.info.InfoService/GetModelsInfo
:443 caikit.runtime.info.InfoService/GetRuntimeInfo

Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

3.9.3. TGIS Standalone ServingRuntime for KServe

:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream
Note
To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI text-generation-inference repository.

Example command

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

3.9.4. OpenVINO Model Server

/v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

3.9.5. vLLM ServingRuntime for KServe

:443/version
:443/docs
:443/v1/models
:443/v1/chat/completions
:443/v1/completions
:443/v1/embeddings
:443/tokenize
:443/detokenize
Note
- The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
- To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
- As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.
  containers: - args: - --chat-template=<CHAT_TEMPLATE>
  You can use the chat templates that are available as .jinja files here or with the vLLM image under /apps/data/template. For more information, see Chat templates.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

3.9.6. vLLM ServingRuntime with Gaudi accelerators support for KServe

See vLLM ServingRuntime for KServe.

3.9.7. vLLM ROCm ServingRuntime for KServe

See vLLM ServingRuntime for KServe.

3.9.8. NVIDIA Triton Inference Server

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Note

ModelMesh does not support the following REST endpoints:

v2/health/live
v2/health/ready
v2/models/<model_name>[/versions/]/ready

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

3.9.9. Additional resources

3.10. About KServe deployment modes

By default, you can deploy models on the single-model serving platform with KServe by using Red Hat OpenShift Serverless, which is a cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project. In addition, serverless mode is dependent on the Red Hat OpenShift Serverless Operator.

Alternatively, you can use raw deployment mode, which is not dependent on the Red Hat OpenShift Serverless Operator. With raw deployment mode, you can deploy models with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler.

Important

Deploying a machine learning model using KServe raw deployment mode is a Limited Availability feature. Limited Availability means that you can install and receive support for the feature only with specific approval from the Red Hat AI Business Unit. Without such approval, the feature is unsupported. In addition, this feature is only supported on Self-Managed deployments of single node OpenShift.

There are both advantages and disadvantages to using each of these deployment modes:

3.10.1. Serverless mode

Advantages:

Enables autoscaling based on request volume:
- Resources scale up automatically when receiving incoming requests.
- Optimizes resource usage and maintains performance during peak times.
Supports scale down to and from zero using Knative:
- Allows resources to scale down completely when there are no incoming requests.
- Saves costs by not running idle resources.

Disadvantages:

Has customization limitations:
- Serverless is limited to Knative, such as when mounting multiple volumes.
Dependency on Knative for scaling:
- Introduces additional complexity in setup and management compared to traditional scaling methods.

3.10.2. Raw deployment mode

Advantages:

Enables deployment with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler:
- Provides full control over Kubernetes resources, allowing for detailed customization and configuration of deployment settings.
Unlocks Knative limitations, such as being unable to mount multiple volumes:
- Beneficial for applications requiring complex configurations or multiple storage mounts.

Disadvantages:

Does not support automatic scaling:
- Does not support automatic scaling down to zero resources when idle.
- Might result in higher costs during periods of low traffic.
Requires manual management of scaling.

3.11. Deploying models on single node OpenShift using KServe raw deployment mode

You can deploy a machine learning model by using KServe raw deployment mode on single node OpenShift. Raw deployment mode offers several advantages over Knative, such as the ability to mount multiple volumes.

Important

Deploying a machine learning model using KServe raw deployment mode on single node OpenShift is a Limited Availability feature. Limited Availability means that you can install and receive support for the feature only with specific approval from the Red Hat AI Business Unit. Without such approval, the feature is unsupported.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have cluster administrator privileges for your OpenShift cluster.
You have created an OpenShift cluster that has a node with at least 4 CPUs and 16 GB memory.
You have installed the Red Hat OpenShift AI (RHOAI) Operator.
You have installed the OpenShift command-line interface (CLI). For more information about installing the OpenShift command-line interface (CLI), see Getting started with the OpenShift CLI.
You have installed KServe.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
To use the vLLM runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.

Procedure

Open a command-line terminal and log in to your OpenShift cluster as cluster administrator:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
By default, OpenShift uses a service mesh for network traffic management. Because KServe raw deployment mode does not require a service mesh, disable Red Hat OpenShift Service Mesh:
1. Enter the following command to disable Red Hat OpenShift Service Mesh:
```
$ oc edit dsci -n redhat-ods-operator
```
2. In the YAML editor, change the value of managementState for the serviceMesh component to Removed as shown:
```
spec:
  components:
    serviceMesh:
      managementState: Removed
```
3. Save the changes.

Create a project:

$ oc new-project <project_name> --description="<description>" --display-name="<display_name>"

For information about creating projects, see Working with projects.

Create a data science cluster:
1. In the Red Hat OpenShift web console Administrator view, click Operators Installed Operators and then click the Red Hat OpenShift AI Operator.
2. Click the Data Science Cluster tab.
3. Click the Create DataScienceCluster button.
4. In the Configure via field, click the YAML view radio button.
5. In the spec.components section of the YAML editor, configure the kserve component as shown:
```
  kserve:
    defaultDeploymentMode: RawDeployment
    managementState: Managed
    serving:
      managementState: Removed
      name: knative-serving
```
6. Click Create.

Create a secret file:

At your command-line terminal, create a YAML file to contain your secret and add the following YAML code:

apiVersion: v1
kind: Secret
metadata:
  annotations:
    serving.kserve.io/s3-endpoint: <AWS_ENDPOINT>
    serving.kserve.io/s3-usehttps: "1"
    serving.kserve.io/s3-region: <AWS_REGION>
    serving.kserve.io/s3-useanoncredential: "false"
  name: <Secret-name>
stringData:
  AWS_ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
  AWS_SECRET_ACCESS_KEY: "<AWS_SECRET_ACCESS_KEY>"

Important

If you are deploying a machine learning model in a disconnected deployment, add serving.kserve.io/s3-verifyssl: '0' to the metadata.annotations section.

Save the file with the file name secret.yaml.

Apply the secret.yaml file:

$ oc apply -f secret.yaml -n <namespace>

Create a service account:
1. Create a YAML file to contain your service account and add the following YAML code:
```
apiVersion: v1
kind: ServiceAccount
metadata:
  name: models-bucket-sa
secrets:
- name: s3creds
```
  For information about service accounts, see Understanding and creating service accounts.
2. Save the file with the file name serviceAccount.yaml.
3. Apply the serviceAccount.yaml file:
```
$ oc apply -f serviceAccount.yaml -n <namespace>
```

Create a YAML file for the serving runtime to define the container image that will serve your model predictions. Here is an example using the OpenVino Model Server:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: ovms-runtime
spec:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8888"
  containers:
    - args:
        - --model_name={{.Name}}
        - --port=8001
        - --rest_port=8888
        - --model_path=/mnt/models
        - --file_system_poll_wait_seconds=0
        - --grpc_bind_address=0.0.0.0
        - --rest_bind_address=0.0.0.0
        - --target_device=AUTO
        - --metrics_enable
      image: quay.io/modh/openvino_model_server@sha256:6c7795279f9075bebfcd9aecbb4a4ce4177eec41fb3f3e1f1079ce6309b7ae45
      name: kserve-container
      ports:
        - containerPort: 8888
          protocol: TCP
  multiModel: false
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: openvino_ir
      version: opset13
    - name: onnx
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: paddle
      version: "2"
    - autoSelect: true
      name: pytorch
      version: "2"

If you are using the OpenVINO Model Server example above, ensure that you insert the correct values required for any placeholders in the YAML code.
Save the file with an appropriate file name.

Apply the file containing your serving run time:

$ oc apply -f <serving run time file name> -n <namespace>

Create an InferenceService custom resource (CR). Create a YAML file to contain the InferenceService CR. Using the OpenVINO Model Server example used previously, here is the corresponding YAML code:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.knative.openshift.io/enablePassthrough: "true"
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
    serving.kserve.io/deploymentMode: RawDeployment
  name: <InferenceService-Name>
spec:
  predictor:
    scaleMetric:
    minReplicas: 1
    scaleTarget:
    canaryTrafficPercent:
    serviceAccountName: <serviceAccountName>
    model:
      env: []
      volumeMounts: []
      modelFormat:
        name: onnx
      runtime: ovms-runtime
      storageUri: s3://<bucket_name>/<model_directory_path>
      resources:
        requests:
          memory: 5Gi
    volumes: []

In your YAML code, ensure the following values are set correctly:
- serving.kserve.io/deploymentMode must contain the value RawDeployment.
- modelFormat must contain the value for your model format, such as onnx.
- storageUri must contain the value for your model s3 storage directory, for example s3://<bucket_name>/<model_directory_path>.
- runtime must contain the value for the name of your serving runtime, for example, ovms-runtime.
Save the file with an appropriate file name.

Apply the file containing your InferenceService CR:

$ oc apply -f <InferenceService CR file name> -n <namespace>

Verify that all pods are running in your cluster:

$ oc get pods -n <namespace>

Example output:

NAME READY STATUS RESTARTS AGE
<isvc_name>-predictor-xxxxx-2mr5l 1/1 Running 2 165m
console-698d866b78-m87pm 1/1 Running 2 165m

After you verify that all pods are running, forward the service port to your local machine:
```
$ oc -n <namespace> port-forward pod/<pod-name> <local_port>:<remote_port>
```
Ensure that you replace <namespace>, <pod-name>, <local_port>, <remote_port> (this is the model server port, for example, 8888) with values appropriate to your deployment.

Verification

Use your preferred client library or tool to send requests to the localhost inference URL.

3.12. Deploying models by using the single-model serving platform

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Working with certificates (OpenShift AI Self-Managed) or Working with certificates (OpenShift AI Self-Managed in a disconnected environment).

3.12.1. Enabling the single-model serving platform

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have installed KServe.
Your cluster administrator has not edited the OpenShift AI dashboard configuration to disable the ability to select the single-model serving platform, which uses the KServe component. For more information, see Dashboard configuration options.

Procedure

Enable the single-model serving platform as follows:
1. In the left menu, click Settings Cluster settings.
2. Locate the Model serving platforms section.
3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
4. Click Save changes.
Enable preinstalled runtimes for the single-model serving platform as follows:
1. In the left menu of the OpenShift AI dashboard, click Settings Serving runtimes.
  The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
  For more information about preinstalled runtimes, see Supported runtimes.
2. Set the runtime that you want to use to Enabled.
  The single-model serving platform is now available for model deployments.

3.12.2. Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the pre-installed runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs. For example, if the TGIS runtime does not support a model format that is supported by Hugging Face Text Generation Inference (TGI), you can create a custom runtime to add support for the model.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

From the OpenShift AI dashboard, click Settings > Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, TGIS Standalone ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  1. Click Upload files.
  2. In the file browser, select a YAML file on your computer.
    The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  1. Click Start from scratch.
  2. Enter or paste YAML code directly in the embedded editor.
Note
In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

3.12.3. Adding a tested and verified model-serving runtime for the single-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Red Hat OpenShift AI.

You can use the Red Hat OpenShift AI dashboard to add and enable the NVIDIA Triton Inference Server runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.

Procedure

From the OpenShift AI dashboard, click Settings > Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.

Click Start from scratch.

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"
volumes:
  - emptyDir: null
    medium: Memory
    sizeLimit: 2Gi
    name: shm

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added).
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Note
If you do not configure a custom display name for your runtime, OpenShift AI shows the value of the metadata.name field.
Click Create.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

Tested and verified model-serving runtimes

3.12.4. Deploying models on the single-model serving platform

When you have enabled the single-model serving platform, you can enable a pre-installed or custom model-serving runtime and start to deploy models on the platform.

Note

Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

Prerequisites

You have logged in to Red Hat OpenShift AI.
If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
You have installed KServe.
You have enabled the single-model serving platform.
To enable token authorization and external model routes for deployed models, you have added Authorino as an authorization provider. For more information, see Adding an authorization provider for the single-model serving platform.
You have created a data science project.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
To use the vLLM runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs
To use the vLLM ServingRuntime with Gaudi accelerators support for KServe runtime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi AI accelerator operator and configuring an accelerator profile. For more information, see Setting up Gaudi for OpenShift and Working with accelerators.
To use the vLLM ROCm ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring an accelerator profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with accelerators.
Note
In OpenShift AI 2.16, Red Hat supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
To deploy RHEL AI models:
- You have enabled the vLLM ServingRuntime for KServe runtime.
- You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.

Procedure

In the left menu, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Models tab.
Perform one of the following actions:
- If you see a Single-model serving platform tile, click Deploy model on the tile.
- If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
In the Model deployment name field, enter a unique name for the model that you are deploying.
In the Serving runtime field, select an enabled runtime.
From the Model framework (name - version) list, select a value.
In the Number of model server replicas to deploy field, specify a value.
From the Model server size list, select a value.
The following options are only available if you have enabled accelerator support on your cluster and created an accelerator profile:
1. From the Accelerator list, select an accelerator.
2. If you selected an accelerator in the preceding step, specify the number of accelerators to use in the Number of accelerators field.
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authorization for inference requests to the deployed model, perform the following actions:
1. Select Require token authorization.
2. In the Service account name field, enter the service account name that the token will be generated for.
To specify the location of your model, perform one of the following sets of actions:
- To use an existing connection
  1. Select Existing connection.
  2. From the Name list, select a connection that you previously defined.
  3. In the Path field, enter the folder path that contains the model in your specified data source.
    Important
    The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
- To use a new connection
  1. To define a new connection that your model can access, select New connection.
    In the Add connection modal, select a Connection type. The S3 compatible object storage and URI options are pre-installed connection types. Additional options might be available if your OpenShift AI administrator added them.
    The Add connection form opens with fields specific to the connection type that you selected.
  2. Fill in the connection detail fields.
    Important
    If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
(Optional) Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  The Configuration parameters section shows predefined serving runtime parameters, if any are available.
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

3.12.5. Setting a timeout for KServe

When deploying large models or using node autoscaling with KServe, the operation may time out before a model is deployed because the default progress-deadline that KNative Serving sets is 10 minutes.

If a pod using KNative Serving takes longer than 10 minutes to deploy, the pod might be automatically marked as failed. This can happen if you are deploying large models that take longer than 10 minutes to pull from S3-compatible object storage or if you are using node autoscaling to reduce the consumption of GPU nodes.

To resolve this issue, you can set a custom progress-deadline in the KServe InferenceService for your application.

Prerequisites

You have namespace edit access for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
Select the project where you have deployed the model.
In the Administrator perspective, click Home Search.
From the Resources dropdown menu, search for InferenceService.
Under spec.predictor.annotations, modify the serving.knative.dev/progress-deadline with the new timeout:
```
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceService
metadata:
  name: my-inference-service
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 30m
```
Note
Ensure that you set the progress-deadline on the spec.predictor.annotations level, so that the KServe InferenceService can copy the progress-deadline back to the KNative Service object.

3.12.6. Customizing the parameters of a deployed model-serving runtime

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note

Customizing the parameters of a runtime only affects the selected model deployment.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have deployed a model on the single-model serving platform.

Procedure

From the OpenShift AI dashboard, click Model Serving in the left menu.
The Deployed models page opens.
Click the action menu (⋮) next to the name of the model you want to customize and select Edit.
The Configuration parameters section shows predefined serving runtime parameters, if any are available.
Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
After you are done customizing the runtime parameters, click Redeploy to save and deploy the model with your changes.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.
Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:
- Checking the InferenceService YAML from the OpenShift Console.
- Using the following command in the OpenShift CLI:
```
oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>
```

3.12.7. Customizable model serving runtime parameters

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Serving runtime	Resource
NVIDIA Triton Inference Server	NVIDIA Triton Inference Server: Model Parameters
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe	Caikit NLP: Configuration TGIS: Model configuration
Caikit Standalone ServingRuntime for KServe	Caikit NLP: Configuration
OpenVINO Model Server	OpenVINO Model Server Features: Dynamic Input Parameters
Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe	TGIS: Model configuration
vLLM ServingRuntime for KServe	vLLM: Engine Arguments OpenAI Compatible Server

Additional resources

Customizing the parameters of a deployed model serving runtime

3.12.8. Using OCI containers for model storage

As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Using OCI containers for model storage can help you:

Reduce startup times by avoiding downloading the same model multiple times.
Reduce disk space usage by reducing the number of models downloaded locally.
Improve model performance by allowing pre-fetched images.

Using OCI containers for model storage involves the following tasks:

Storing a model in an OCI image
Deploying a model from an OCI image

Important

Using OCI containers for model storage is currently available in Red Hat OpenShift AI 2.16 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

3.12.8.1. Storing a model in an OCI image

You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format.

Prerequisites

You have a model in the ONNX format. The example in this procedure uses the MobileNet v2-7 model in ONNX format.
You have installed the Podman tool.

Procedure

In a terminal window on your local machine, create a temporary directory for storing both the model and the support files that you need to create the OCI image:
```
cd $(mktemp -d)
```
Create a models folder inside the temporary directory:
```
mkdir -p models/1
```
Note
This example command specifies the subdirectory 1 because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the 1 subdirectory to use OCI container images.

Download the model and support files:

DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
curl -L $DOWNLOAD_URL -O --output-dir models/1/

Use the tree command to confirm that the model files are located in the directory structure as expected:
```
tree
```
The tree command should return a directory structure similar to the following example:
```
.
├── Containerfile
└── models
    └── 1
        └── mobilenetv2-7.onnx
```
Create a Docker file named Containerfile:
Note
- Specify a base image that provides a shell. In the following example, ubi9-micro is the base container image. You cannot specify an empty image that does not provide a shell, such as scratch, because KServe uses the shell to ensure the model files are accessible to the model server.
- Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID.
```
FROM registry.access.redhat.com/ubi9/ubi-micro:latest
COPY --chown=0:0 models /models
RUN chmod -R a=rX /models

# nobody user
USER 65534
```
Use podman build commands to create the OCI container image and upload it to a registry. The following commands use Quay as the registry.
Note
If your repository is private, ensure that you are authenticated to the registry before uploading your container image.
```
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> .
podman push quay.io/<user_name>/<repository_name>:<tag_name>
```

3.12.8.2. Deploying a model stored in an OCI image

You can deploy a model that is stored in an OCI image.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note

By default in KServe, models are exposed outside the cluster and not protected with authorization.

Prerequisites

You have stored a model in an OCI image as described in Storing a model in an OCI image.
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
You are logged in to your OpenShift cluster.

Procedure

Create a project to deploy the model:
```
oc new-project oci-model-example
```
Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```

Verify that the ServingRuntime named kserve-ovms is created:

oc get servingruntimes

The command should return output similar to the following:

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

3.12.9. Using accelerators with vLLM

OpenShift AI includes support for NVIDIA, AMD and Intel Gaudi accelerators. OpenShift AI also includes preinstalled model-serving runtimes that provide accelerator support.

3.12.9.1. NVIDIA GPUs

You can serve models with NVIDIA graphics processing units (GPUs) by using the vLLM ServingRuntime for KServe runtime. To use the runtime, you must enable GPU support in OpenShift AI. This includes installing and configuring the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.

3.12.9.2. Intel Gaudi accelerators

You can serve models with Intel Gaudi accelerators by using the vLLM ServingRuntime with Gaudi accelerators support for KServe runtime. To use the runtime, you must enable hybrid processing support (HPU) support in OpenShift AI. This includes installing the Intel Gaudi AI accelerator operator and configuring an accelerator profile. For more information, see Setting up Gaudi for OpenShift and Working with accelerator profiles.

For information about recommended vLLM parameters, environment variables, supported configurations and more, see vLLM with Intel® Gaudi® AI Accelerators.

3.12.9.3. AMD GPUs

You can serve models with AMD GPUs by using the vLLM ROCm ServingRuntime for KServe runtime. To use the runtime, you must enable support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring an accelerator profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with accelerator profiles.

Additional resources

Supported model-serving runtimes

3.12.10. Customizing the vLLM model-serving runtime

In certain cases, you may need to add additional flags or environment variables to the vLLM ServingRuntime for KServe runtime to deploy a family of LLMs.

The following procedure describes customizing the vLLM model-serving runtime to deploy a Llama, Granite or Mistral model.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
For Llama model deployment, you have downloaded a meta-llama-3 model to your object storage.
For Granite model deployment, you have downloaded a granite-7b-instruct or granite-20B-code-instruct model to your object storage.
For Mistral model deployment, you have downloaded a mistral-7B-Instruct-v0.3 model to your object storage.
You have enabled the vLLM ServingRuntime for KServe runtime.
You have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select vLLM ServingRuntime for KServe.
If you are deploying a meta-llama-3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
–-distributed-executor-backend=mp 1
--max-model-len=6144 2
```
1
Sets the backend to multiprocessing for distributed model workers
2
Sets the maximum context length of the model to 6144 tokens
If you are deploying a granite-7B-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 1
```
1
Sets the backend to multiprocessing for distributed model workers
If you are deploying a granite-20B-code-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 1
–-tensor-parallel-size=4 2
--max-model-len=6448 3
```
1
Sets the backend to multiprocessing for distributed model workers
2
Distributes inference across 4 GPUs in a single node
3
Sets the maximum context length of the model to 6448 tokens
If you are deploying a mistral-7B-Instruct-v0.3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp 1
--max-model-len=15344 2
```
1
Sets the backend to multiprocessing for distributed model workers
2
Sets the maximum context length of the model to 15344 tokens
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

For granite models, use the following example command to verify API requests to your deployed model:

curl -q -X 'POST' \
    "https://<inference_endpoint_url>:443/v1/chat/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
    \"model\": \"<model_name>\",
    \"prompt\": \"<prompt>",
    \"max_tokens\": <max_tokens>,
    \"temperature\": <temperature>
    }"

Additional resources

vLLM: Engine Arguments

3.13. Deploying models by using multiple GPU nodes

Deploy models across multiple GPU nodes to handle large models, such as large language models (LLMs).

This procedure shows you how to serve models on Red Hat OpenShift AI across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime custom runtime. The vllm-multinode-runtime runtime uses the same image as the VLLM ServingRuntime for KServe runtime and also includes information necessary for multi-GPU inferencing.

Important

Deploying models by using multiple GPU nodes is currently available in Red Hat OpenShift AI as a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Enabling accelerators.
- You are using an NVIDIA GPU (nvidia.com/gpu).
- You have specified the GPU type through either the ServingRuntime or InferenceService. If the GPU type specified in the ServingRuntime differs from what is set in the InferenceService, both GPU types are assigned to the resource and can cause errors.
You have enabled KServe on your cluster.
You have only one head pod in your setup. Do not adjust the replica count using the min_replicas or max_replicas settings in the InferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster.
You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Select or create a namespace for deploying the model. For example, you can create the kserve-demo namespace by running the following command:
```
oc new-project kserve-demo
```

In the target namespace, create a PVC for model storage and specify the name of your storage class. The storage class must be file storage.

Note

If you have already configured a PVC, you can skip this step.

kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: granite-8b-code-base-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
  storageClassName: __<fileStorageClassName>__

Download the model to the PVC. For example:

apiVersion: v1
kind: Pod
metadata:
  name: download-granite-8b-code
  labels:
    name: download-granite-8b-code
spec:
  volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: granite-8b-code-claim
  restartPolicy: Never
  initContainers:
    - name: fix-volume-permissions
      image: quay.io/quay/busybox@sha256:xxxxx
      command: ["sh"]
      args: ["-c", "mkdir -p /mnt/models/granite-8b-code-base && chmod -R 777 /mnt/models"]
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume
  containers:
    - resources:
        requests:
          memory: 40Gi
      name: download-model
      imagePullPolicy: IfNotPresent
      image: quay.io/modh/kserve-storage-initializer@sha256:xxxxx
      args:
        - 's3://$<bucket_name>/granite-8b-code-base/'
        - /mnt/models/granite-8b-code-base
      env:
        - name: AWS_ACCESS_KEY_ID
          value: <id>
        - name: AWS_SECRET_ACCESS_KEY
          value: <secret>
        - name: BUCKET_NAME
          value: <bucket_name>
        - name: S3_USE_HTTPS
          value: "1"
        - name: AWS_ENDPOINT_URL
          value: <AWS endpoint>
        - name: awsAnonymousCredential
          value: 'false'
        - name: AWS_DEFAULT_REGION
          value: <region>
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume

Create the vllm-multinode-runtime custom runtime:

oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n kserve-demo -f -

Deploy the model using the following InferenceService configuration:
```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: granite-8b-code-base-pvc
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: pvc://granite-8b-code-base-pvc/granite-8b-code-base
    workerSpec: {}
```
The following configuration can be added to the InferenceService:
- workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value of workerSpec.tensorParallelSize is at least 1.
- workerSpec.pipelineParallelSize: Determines how many nodes are involved in the deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value of workerSpec.pipelineParallelSize is at least 2.

Verification

To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the ray cluster status, and send a request to the model.

Check the GPU resource status:

Retrieve the pod names for the head and worker nodes:

# Get pod name
podName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor-worker --no-headers|cut -d' ' -f1)

oc wait --for=condition=ready pod/${podName} --timeout=300s
# Check the GPU memory size for both the head and worker pods:
echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi

Sample response

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   33C    P0             71W /  300W |19031MiB /  23028MiB <1>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
         ...
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P0             69W /  300W |18959MiB /  23028MiB <2>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is 0MiB.

Verify the status of your InferenceService using the following command:

Note

In the Technology Preview, you can only use port forwarding for inferencing.

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export MODEL_NAME=granite-8b-code-base-pvc

Sample response

   NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
   granite-8b-code-base-pvc   http://granite-8b-code-base-pvc.default.example.com

Send a request to the model to confirm that the model is available for inference:

oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s

oc port-forward $podName 8080:8080 &

curl http://localhost:8080/v1/completions \
       -H "Content-Type: application/json" \
       -d "{
            'model': "$MODEL_NAME",
            'prompt': 'At what temperature does Nitrogen boil?',
            'max_tokens': 100,
            'temperature': 0
        }"

3.14. Making inference requests to models deployed on the single-model serving platform

When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify this in your inference requests.

3.14.1. Accessing the authorization token for a deployed model

If you secured your model inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify it in your inference requests.

Prerequisites

You have logged in to Red Hat OpenShift AI.
If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
You have deployed a model by using the single-model serving platform.

Procedure

From the OpenShift AI dashboard, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
Click the Models tab.
In the Models and model servers list, expand the section for your model.
Your authorization token is shown in the Token authorization section, in the Token secret field.
Optional: To copy the authorization token for use in an inference request, click the Copy button ( ) next to the token value.

3.14.2. Accessing the inference endpoint for a deployed model

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

You have logged in to Red Hat OpenShift AI.
If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
You have deployed a model by using the single-model serving platform.
If you enabled token authorization for your deployed model, you have the associated token value.

Procedure

From the OpenShift AI dashboard, click Model Serving.
The inference endpoint for the model is shown in the Inference endpoint field.
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
Use the endpoint to make API requests to your deployed model.

Additional resources

3.15. Configuring monitoring for the single-model serving platform

The single-model serving platform includes metrics for supported runtimes of the KServe component. KServe does not generate its own metrics and relies on the underlying model-serving runtimes to provide them. The set of available metrics for a deployed model depends on its model-serving runtime.

In addition to runtime metrics for KServe, you can also configure monitoring for OpenShift Service Mesh. The OpenShift Service Mesh metrics help you to understand dependencies and traffic flow between components in the mesh.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: debug
      retention: 15d
```
The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.
Apply the configuration to create the user-workload-monitoring-config object.
```
$ oc apply -f uwm-cm-conf.yaml
```
Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```
The cluster-monitoring-config object enables monitoring for user-defined projects.
Apply the configuration to create the cluster-monitoring-config object.
```
$ oc apply -f uwm-cm-enable.yaml
```

Create ServiceMonitor and PodMonitor objects to monitor metrics in the service mesh control plane as follows:

Create an istiod-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod-monitor
  namespace: istio-system
spec:
  targetLabels:
  - app
  selector:
    matchLabels:
      istio: pilot
  endpoints:
  - port: http-monitoring
    interval: 30s

Deploy the ServiceMonitor CR in the specified istio-system namespace.
```
$ oc apply -f istiod-monitor.yaml
```
You see the following output:
```
servicemonitor.monitoring.coreos.com/istiod-monitor created
```

Create an istio-proxies-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: istio-proxies-monitor
  namespace: istio-system
spec:
  selector:
    matchExpressions:
    - key: istio-prometheus-ignore
      operator: DoesNotExist
  podMetricsEndpoints:
  - path: /stats/prometheus
    interval: 30s

Deploy the PodMonitor CR in the specified istio-system namespace.

$ oc apply -f istio-proxies-monitor.yaml

You see the following output:

podmonitor.monitoring.coreos.com/istio-proxies-monitor created

3.16. Viewing model-serving runtime metrics for the single-model serving platform

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

A cluster administrator has configured monitoring for the single-model serving platform.
You have been assigned the monitoring-rules-view role. For more information, see Granting users permission to configure monitoring for user-defined projects.
You are familiar with how to monitor project metrics in the OpenShift web console. For more information, see Monitoring your project metrics.

Procedure

Log in to the OpenShift web console.
Switch to the Developer perspective.
In the left menu, click Observe.
As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.
1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
```
sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
```
  Note
  Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.
2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
```
sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
```
3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
```
sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
```
4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
```
sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
```

Additional resources

3.17. Monitoring model performance

In the single-model serving platform, you can view performance metrics for a specific model that is deployed on the platform.

3.17.1. Viewing performance metrics for a deployed model

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

You have installed Red Hat OpenShift AI.
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
You have logged in to Red Hat OpenShift AI.
If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
For more information, see Dashboard configuration options.
You have deployed a model on the single-model serving platform by using a preinstalled runtime.
Note
Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

From the OpenShift AI dashboard navigation menu, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Models tab.
Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

3.18. Optimizing model-serving runtimes

You can optionally enhance the preinstalled model-serving runtimes available in OpenShift AI to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

3.19. Performance optimization and tuning

3.19.1. Determining GPU requirements for LLM-powered applications

There are several factors to consider when choosing GPUs for applications powered by a Large Language Model (LLM) hosted on OpenShift AI.

The following guidelines help you determine the hardware requirements for your application, depending on the size and expected usage of your model.

Estimating memory needs: A general rule of thumb is that a model with N parameters in 16-bit precision requires approximately 2N bytes of GPU memory. For example, an 8-billion-parameter model requires around 16GB of GPU memory, while a 70-billion-parameter model requires around 140GB.
Quantization: To reduce memory requirements and potentially improve throughput, you can use quantization to load or run the model at lower-precision formats such as INT8, FP8, or INT4. This reduces the memory footprint at the expense of a slight reduction in model accuracy.
Note
The vLLM ServingRuntime for KServe model-serving runtime supports several quantization methods. For more information about supported implementations and compatible hardware, see Supported hardware for quantization kernels.
Additional memory for key-value cache: In addition to model weights, GPU memory is also needed to store the attention key-value (KV) cache, which increases with the number of requests and the sequence length of each request. This can impact performance in real-time applications, especially for larger models.
Recommended GPU configurations:
- Small Models (1B–8B parameters): For models in the range, a GPU with 24GB of memory is generally sufficient to support a small number of concurrent users.
- Medium Models (10B–34B parameters):
  - Models under 20B parameters require at least 48GB of GPU memory.
  - Models that are between 20B - 34B parameters require at least 80GB or more of memory in a single GPU.
- Large Models (70B parameters): Models in this range may need to be distributed across multiple GPUs by using tensor parallelism techniques. Tensor parallelism allows the model to span multiple GPUs, improving inter-token latency and increasing the maximum batch size by freeing up additional memory for KV cache. Tensor parallelism works best when GPUs have fast interconnects such as an NVLink.
- Very Large Models (405B parameters): For extremely large models, quantization is recommended to reduce memory demands. You can also distribute the model using pipeline parallelism across multiple GPUs, or even across two servers. This approach allows you to scale beyond the memory limitations of a single server, but requires careful management of inter-server communication for optimal performance.

For best results, start with smaller models and then scale up to larger models as required, using techniques such as parallelism and quantization to meet your performance and memory requirements.

Additional resources

Distributed serving

3.19.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications

There are additional factors that need to be taken into consideration for text-summarization and RAG applications, as well as for LLM-powered services that process large documents uploaded by users.

Longer Input Sequences: The input sequence length can be significantly longer than in a typical chat application, if each user query includes a large prompt or a large amount of context such as an uploaded document. The longer input sequence length increases the prefill time, the time the model takes to process the initial input sequence before generating a response, which can then lead to a higher Time-to-First-Token (TTFT). A longer TTFT may impact the responsiveness of the application. Minimize this latency for optimal user experience.
KV Cache Usage: Longer sequences require more GPU memory for the key-value (KV) cache. The KV cache stores intermediate attention data to improve model performance during generation. A high KV cache utilization per request requires a hardware setup with sufficient GPU memory. This is particularly crucial if multiple users are querying the model concurrently, as each request adds to the total memory load.
Optimal Hardware Configuration: To maintain responsiveness and avoid memory bottlenecks, select a GPU configuration with sufficient memory. For instance, instead of running an 8B model on a single 24GB GPU, deploying it on a larger GPU (e.g., 48GB or 80GB) or across multiple GPUs can improve performance by providing more memory headroom for the KV cache and reducing inter-token latency. Multi-GPU setups with tensor parallelism can also help manage memory demands and improve efficiency for larger input sequences.

In summary, to ensure optimal responsiveness and scalability for document-based applications, you must prioritize hardware with high GPU memory capacity and also consider multi-GPU configurations to handle the increased memory requirements of long input sequences and KV caching.

3.19.3. Inference performance metrics

Latency, throughput and cost per million tokens are key metrics to consider when evaluating the response generation efficiency of a model during inferencing. These metrics provide a comprehensive view of a model’s inference performance and can help balance speed, efficiency, and cost for different use cases.

3.19.3.1. Latency

Latency is critical for interactive or real-time use cases, and is measured using the following metrics:

Time-to-First-Token (TTFT): The delay in milliseconds between the initial request and the generation of the first token. This metric is important for streaming responses.
Inter-Token Latency (ITL): The time taken in milliseconds to generate each subsequent token after the first, also relevant for streaming.
Time-Per-Output-Token (TPOT): For non-streaming requests, the average time taken in milliseconds to generate each token in an output sequence.

3.19.3.2. Throughput

Throughput measures the overall efficiency of a model server and is expressed with the following metrics:

Tokens per Second (TPS): The total number of tokens generated per second across all active requests.
Requests per Second (RPS): The number of requests processed per second. RPS, like response time, is sensitive to sequence length.

3.19.3.3. Cost per million tokens

Cost per Million Tokens measures the cost-effectiveness of a model’s inference, indicating the expense incurred per million tokens generated. This metric helps to assess both the economic feasibility and scalability of deploying the model.

3.19.4. Resolving CUDA out-of-memory errors

In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.

Procedure

From the OpenShift AI dashboard, click Settings > Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Based on the runtime that you used to deploy your model, perform one of the following actions:
- If you used the pre-installed TGIS Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.
- If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.
  The embedded YAML editor opens and shows the contents of the custom model-serving runtime.
Add or update the BATCH_SAFETY_MARGIN environment variable and set the value to 30. Similarly, add or update the ESTIMATE_MEMORY_BATCH_SIZE environment variable and set the value to 8.
```
spec:
  containers:
    env:
    - name: BATCH_SAFETY_MARGIN
      value: 30
    - name: ESTIMATE_MEMORY_BATCH
      value: 8
```
Note
The BATCH_SAFETY_MARGIN parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value of BATCH_SAFETY_MARGIN is 20. The ESTIMATE_MEMORY_BATCH_SIZE parameter sets the batch size used in the memory auto-tuning algorithm. The default value of ESTIMATE_MEMORY_BATCH_SIZE is 16.
Click Update.
The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
To redeploy the model for the parameter updates to take effect, perform the following actions:
1. From the OpenShift AI dashboard, click Model Serving > Deployed Models.
2. Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.
3. Redeploy the model as described in Deploying models on the single-model serving platform.

Verification

You receive successful responses from the model server and no longer see CUDA OOM errors.

3.20. About the NVIDIA NIM model serving platform

You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

Additional resources

NVIDIA NIM

3.20.1. Enabling the NVIDIA NIM model serving platform

As an administrator, you can use the Red Hat OpenShift AI dashboard to enable the NVIDIA NIM model serving platform.

Note

If you previously enabled the NVIDIA NIM model serving platform in OpenShift AI 2.14 or 2.15, and then upgraded to a newer version, re-enter your NVIDIA NGC API key to re-enable the NVIDIA NIM model serving platform.

Prerequisites

You have logged in to Red Hat OpenShift AI as an administrator.
You have enabled the single-model serving platform. You do not need to enable a preinstalled runtime. For more information about enabling the single-model serving platform, see Enabling the single-model serving platform.
The following OpenShift AI dashboard configuration is enabled.
```
disableNIMModelServing: false
```
For more information, see Dashboard configuration options.
You have enabled GPU support in OpenShift AI. For more information, see Enabling NVIDIA GPUs.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal. For more information, see NVIDIA GPU Cloud user guide.
Your NCA account is associated with the NVIDIA AI Enterprise Viewer role.
You have generated an NGC API key on the NGC portal. For more information, see NGC API keys.

Procedure

Log in to OpenShift AI.
In the left menu of the OpenShift AI dashboard, click Applications Explore.
On the Explore page, find the NVIDIA NIM tile.
Click Enable on the application tile.
Enter the NGC API key and then click Submit.

Verification

The NVIDIA NIM application that you enabled appears on the Enabled page.

3.20.2. Deploying models on the NVIDIA NIM model serving platform

When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.

Prerequisites

You have logged in to Red Hat OpenShift AI.
If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
You have enabled the NVIDIA NIM model serving platform.
You have created a data science project.
You have enabled support for graphic processing units (GPUs) in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.

Procedure

In the left menu, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Models tab.
In the Models section, perform one of the following actions:
- On the NVIDIA NIM model serving platform tile, click Select NVIDIA NIM on the tile, and then click Deploy model.
- If you have previously selected the NVIDIA NIM model serving type, the Models page displays NVIDIA model serving enabled on the upper-right corner, along with the Deploy model button. To proceed, click Deploy model.
The Deploy model dialog opens.
Configure properties for deploying your model as follows:
1. In the Model deployment name field, enter a unique name for the deployment.
2. From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy. For more information, see Supported Models
3. In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.
4. In the Number of model server replicas to deploy field, specify a value.
5. From the Model server size list, select a value.
6. From the Accelerator list, select an accelerator.
  The Number of accelerators field appears.
7. In the Number of accelerators field, specify the number of accelerators to use. The default value is 1.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

Additional resources

3.1. About the single-model serving platform

3.2. Components

3.3. Installation options

3.4. Authorization

3.5. Monitoring

3.6. Model-serving runtimes

3.6.1. ServingRuntime

3.6.2. InferenceService

3.7. Supported model-serving runtimes

3.8. Tested and verified model-serving runtimes

3.9. Inference endpoints

3.9.1. Caikit TGIS ServingRuntime for KServe

3.9.2. Caikit Standalone ServingRuntime for KServe

3.9.3. TGIS Standalone ServingRuntime for KServe

3.9.4. OpenVINO Model Server

3.9.5. vLLM ServingRuntime for KServe

3.9.6. vLLM ServingRuntime with Gaudi accelerators support for KServe

3.9.7. vLLM ROCm ServingRuntime for KServe

3.9.8. NVIDIA Triton Inference Server

3.9.9. Additional resources

3.10. About KServe deployment modes

3.10.1. Serverless mode

3.10.2. Raw deployment mode

3.11. Deploying models on single node OpenShift using KServe raw deployment mode

3.12. Deploying models by using the single-model serving platform

3.12.1. Enabling the single-model serving platform

3.12.2. Adding a custom model-serving runtime for the single-model serving platform

3.12.3. Adding a tested and verified model-serving runtime for the single-model serving platform

3.12.4. Deploying models on the single-model serving platform

3.12.5. Setting a timeout for KServe

3.12.6. Customizing the parameters of a deployed model-serving runtime

3.12.7. Customizable model serving runtime parameters

3.12.8. Using OCI containers for model storage

3.12.8.1. Storing a model in an OCI image

3.12.8.2. Deploying a model stored in an OCI image

3.12.9. Using accelerators with vLLM

3.12.9.1. NVIDIA GPUs

3.12.9.2. Intel Gaudi accelerators

3.12.9.3. AMD GPUs

3.12.10. Customizing the vLLM model-serving runtime

3.13. Deploying models by using multiple GPU nodes

3.14. Making inference requests to models deployed on the single-model serving platform

3.14.1. Accessing the authorization token for a deployed model

3.14.2. Accessing the inference endpoint for a deployed model

3.15. Configuring monitoring for the single-model serving platform

3.16. Viewing model-serving runtime metrics for the single-model serving platform

3.17. Monitoring model performance

3.17.1. Viewing performance metrics for a deployed model

3.18. Optimizing model-serving runtimes

3.18.1. Enabling speculative decoding and multi-modal inferencing

3.19. Performance optimization and tuning

3.19.1. Determining GPU requirements for LLM-powered applications

3.19.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications

3.19.3. Inference performance metrics

3.19.3.1. Latency

3.19.3.2. Throughput

3.19.3.3. Cost per million tokens

3.19.4. Resolving CUDA out-of-memory errors

3.20. About the NVIDIA NIM model serving platform

3.20.1. Enabling the NVIDIA NIM model serving platform

3.20.2. Deploying models on the NVIDIA NIM model serving platform

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links