Chapter 2. Deploying models on the single-model serving platform

The single-model serving platform deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs).

The platform is based on the KServe component and offers two deployment modes:

KServe RawDeployment: Uses a standard deployment method that does not require serverless dependencies.
Knative Serverless: Uses Red Hat OpenShift Serverless for deployments that can automatically scale based on demand.

2.1. About KServe deployment modes
Copy link

KServe offers two deployment modes for serving models. The default mode, Knative Serverless, is based on the open-source Knative project and provides powerful autoscaling capabilities. It integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh. Alternatively, the KServe RawDeployment mode offers a more traditional deployment method with fewer dependencies.

Before you choose an option, understand how your initial configuration affects future deployments:

If you configure for Knative Serverless: You can use both Knative Serverless and KServe RawDeployment modes.
If you configure for KServe RawDeployment only: You can only use the KServe RawDeployment mode.

Use the following comparison to choose the option that best fits your requirements.

Expand

Table 2.1. Comparison of deployment modes
Criterion	`Knative Serverless`	`KServe RawDeployment`
Default mode	Yes	No
Recommended use case	Most workloads.	Custom serving setups or models that must remain active.
Autoscaling	Scales up automatically based on request volume. Supports scaling down to zero when idle to save costs.	No built-in autoscaling; you can configure Kubernetes Event-Driven Autoscaling (KEDA) or Horizontal Pod Autoscaler (HPA) on your deployment. Does not support scaling to zero by default, which might result in higher costs during periods of low traffic.
Dependencies	Red Hat OpenShift Serverless Operator Red Hat OpenShift Service Mesh Authorino. Only required only if you enable token authentication and external routes.	None; uses standard Kubernetes resources such as `Deployment`, `Service`, and `Horizontal Pod Autoscaler`.
Configuration flexibility	Has some customization limitations inherited from Knative compared to raw Kubernetes deployments.	Provides full control over pod specifications because it uses standard Kubernetes `Deployment` resources.
Resource footprint	Larger, due to the additional dependencies required for serverless functionality.	Smaller.
Setup complexity	Might require additional configuration in setup and management. If Serverless is not already installed on the cluster, you must install and configure it.	Requires a simpler setup with fewer dependencies.

2.2. Deploying models on the single-model serving platform
Copy link

When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.

You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

To successfully deploy a model, you must meet the following prerequisites.

General prerequisites

You have logged in to Red Hat OpenShift AI.
You have installed KServe and enabled the single-model serving platform.
(Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider. For more information, see Adding an authorization provider for the single-model serving platform.
(Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider.
You have created a data science project.
You have access to S3-compatible object storage, a URI-based repository, an OCI-compliant registry or a persistent volume claim (PVC) and have added a connection to your data science project. For more information about adding a connection, see Adding a connection to your data science project.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.

Runtime-specific prerequisites

Meet the requirements for the specific runtime you intend to use.

Caikit-TGIS runtime
- To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
vLLM NVIDIA GPU ServingRuntime for KServe
- To use the vLLM NVIDIA GPU ServingRuntime for KServe runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
vLLM CPU ServingRuntime for KServe
- To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. You cannot use GPU accelerators with IBM Z and IBM Power architectures. For more information, see Red Hat OpenShift Multi Architecture Component Availability Matrix.
vLLM Intel Gaudi Accelerator ServingRuntime for KServe
- To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation in the AMD documentation and Working with hardware profiles.
vLLM AMD GPU ServingRuntime for KServe
- To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
vLLM Spyre AI Accelerator ServingRuntime for KServe
Important
Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
- To use the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime on x86, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

Procedure

In the left menu, click Data science projects.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Models tab.
Click Select single-model to deploy your model using single-model serving.
Click the Deploy model button.
The Deploy model dialog opens.
In the Model deployment name field, enter a unique name for the model that you are deploying.
In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
From the Model framework (name - version) list, select a value if applicable.
From the Deployment mode list, select KServe RawDeployment or Knative Serverless. For more information about deployment modes, see About KServe deployment modes.
In the Number of model server replicas to deploy field, specify a value.
The following options are only available if you have created a hardware profile:
1. From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
  Important
  By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.
2. Optional: To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
1. Select Require token authentication.
2. In the Service account name field, enter the service account name that the token will be generated for.
3. To add an additional service account, click Add a service account and enter another service account name.
To specify the location of your model, select a Connection type that you have added. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your OpenShift AI administrator added them.
1. For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.
  Important
  The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
2. For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.
  Note
  If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.
3. Complete the connection detail fields.
4. Optional: If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, use the Existing cluster storage option to select the PVC and specify the path to the model file.
  Important
  If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
(Optional) Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  The Configuration parameters section shows predefined serving runtime parameters, if any are available.
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

2.3. Deploying a model stored in an OCI image by using the CLI
Copy link

You can deploy a model that is stored in an OCI image from the command line interface.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note

By default in KServe, models are exposed outside the cluster and not protected with authentication.

Prerequisites

You have stored a model in an OCI image as described in Storing a model in an OCI image.
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
You are logged in to your OpenShift cluster.

Procedure

Create a project to deploy the model:
```
oc new-project oci-model-example
```
```
oc new-project oci-model-example
```
Copy to Clipboard Toggle word wrap
Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```
Copy to Clipboard Toggle word wrap

Verify that the ServingRuntime named kserve-ovms is created:

oc get servingruntimes

oc get servingruntimes

Copy to Clipboard

Toggle word wrap

The command should return output similar to the following:

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

Copy to Clipboard

Toggle word wrap

Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

Copy to Clipboard

Toggle word wrap

For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

Copy to Clipboard

Toggle word wrap

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice

oc get inferenceservice

Copy to Clipboard

Toggle word wrap

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

2.4. Deploying models by using Distributed Inference with llm-d
Copy link

Important

Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators.

Key features of Distributed Inference with llm-d include:

Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving.
Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL).
Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure.

Serving models using Distributed Inference with llm-d on Red Hat OpenShift AI consists of the following steps:

Installing OpenShift AI.
Note
Because KServe Serverless conflicts with the Gateway API used for Distributed Inference with llm-d, KServe Serverless is not supported on the same cluster. Instead, use KServe RawDeployment.
Enabling the single model serving platform.
Enabling Distributed Inference with llm-d on a Kubernetes cluster.
Creating an LLMInferenceService Custom Resource (CR).
Deploying a model.

This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService.

Prerequisites

You have enabled the single model-serving platform.
You have access to an OpenShift cluster running version 4.19.9 or later.
OpenShift Service Mesh v2 is not installed in the cluster.
You have created a GatewayClass and a Gateway named openshift-ai-inference in the openshift-ingress namespace as described in Gateway API with OpenShift Container Platform Networking.
You have installed the LeaderWorkerSet Operator in OpenShift. For more information, see the OpenShift documentation.

Procedure

Log in to the OpenShift console as a cluster administrator.
Create a data science cluster initialization (DSCI) and set the serviceMesh.managementState to removed, as shown in the following example:
```
serviceMesh:
  ...
  managementState: Removed
```
```
serviceMesh:
  ...
  managementState: Removed
```
Copy to Clipboard Toggle word wrap

Create a data science cluster (DSC) with the following information set in kserve and serving:

kserve:
  defaultDeploymentMode: RawDeployment
  managementState: Managed
  ...
  serving:
    ...
    managementState: Removed
    ...

kserve:
  defaultDeploymentMode: RawDeployment
  managementState: Managed
  ...
  serving:
    ...
    managementState: Removed
    ...

Copy to Clipboard

Toggle word wrap

Create the LLMInferenceService CR with the following information:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: sample-llm-inference-service
spec:
  replicas: 2
  model:
    uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
    name: RedHatAI/Qwen3-8B-FP8-dynamic
  router:
    route: {}
    gateway: {}
    scheduler: {}
    template:
      containers:
      - name: main
        resources:
          limits:
            cpu: '4'
            memory: 32Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: '2'
            memory: 16Gi
            nvidia.com/gpu: "1"

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: sample-llm-inference-service
spec:
  replicas: 2
  model:
    uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
    name: RedHatAI/Qwen3-8B-FP8-dynamic
  router:
    route: {}
    gateway: {}
    scheduler: {}
    template:
      containers:
      - name: main
        resources:
          limits:
            cpu: '4'
            memory: 32Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: '2'
            memory: 16Gi
            nvidia.com/gpu: "1"

Copy to Clipboard

Toggle word wrap

Customize the following parameters in the spec section of the inference service:

replicas - Specify the number of replicas.
model - Provide the URI to the model based on how the model is stored (uri) and the model name to use in chat completion requests (name).
- S3 bucket: s3://<bucket-name>/<object-key>
- Persistent volume claim (PVC): pvc://<claim-name>/<pvc-path>
- OCI container image: oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
- HuggingFace: hf://<model>/<optional-hash>
router - Provide an HTTPRoute and gateway, or leave blank to automatically create one.

Save the file.

2.4.1. Example usage for Distributed Inference with llm-d
Copy link

These examples show how to use Distributed Inference with llm-d in common scenarios.

Important

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

2.4.1.1. Single-node GPU deployment
Copy link

Use single-GPU-per-replica deployment patterns for development, testing, or production deployments of smaller models, such as 7-billion-parameter models.

You can use the following examples for single-node GPU deployments:

2.4.1.2. Multi-node deployment
Copy link

You can use the following examples for multi-node deployments:

2.4.1.3. Intelligent inference scheduler with KV cache routing
Copy link

You can configure the scheduler to track key-value (KV) cache blocks across inference endpoints and route requests to the endpoint with the highest cache hit rate. This configuration improves throughput and reduces latency by maximizing cache reuse.

For an example, see Precise Prefix KV Cache Routing.

2.5. Monitoring models on the single-model serving platform
Copy link

You can monitor models that are deployed on the single-model serving platform to view performance and resource usage metrics.

2.5.1. Viewing performance metrics for a deployed model
Copy link

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

You have installed Red Hat OpenShift AI.
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
You have logged in to Red Hat OpenShift AI.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
Copy to Clipboard Toggle word wrap
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have deployed a model on the single-model serving platform by using a preinstalled runtime.
Note
Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

From the OpenShift AI dashboard navigation menu, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Models tab.
Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

2.5.2. Viewing model-serving runtime metrics for the single-model serving platform
Copy link

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

A cluster administrator has configured monitoring for the single-model serving platform.
You have been assigned the monitoring-rules-view role. For more information, see Granting users permission to configure monitoring for user-defined projects.
You are familiar with how to monitor project metrics in the OpenShift web console. For more information, see Monitoring your project metrics.

Procedure

Log in to the OpenShift web console.
Switch to the Developer perspective.
In the left menu, click Observe.
As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.
1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
  sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
  Note
  Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.
2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
  sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
  sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap
4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
  sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
  Copy to Clipboard Toggle word wrap

Chapter 2. Deploying models on the single-model serving platform

2.1. About KServe deployment modes
Copy link

2.2. Deploying models on the single-model serving platform
Copy link

2.3. Deploying a model stored in an OCI image by using the CLI
Copy link

2.4. Deploying models by using Distributed Inference with llm-d
Copy link

2.4.1. Example usage for Distributed Inference with llm-d
Copy link

2.4.1.1. Single-node GPU deployment
Copy link

2.4.1.2. Multi-node deployment
Copy link

2.4.1.3. Intelligent inference scheduler with KV cache routing
Copy link

2.5. Monitoring models on the single-model serving platform
Copy link

2.5.1. Viewing performance metrics for a deployed model
Copy link

2.5.2. Viewing model-serving runtime metrics for the single-model serving platform
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 2. Deploying models on the single-model serving platform

2.1. About KServe deployment modesCopy linkLink copied to clipboard!

2.2. Deploying models on the single-model serving platformCopy linkLink copied to clipboard!

2.3. Deploying a model stored in an OCI image by using the CLICopy linkLink copied to clipboard!

2.4. Deploying models by using Distributed Inference with llm-dCopy linkLink copied to clipboard!

2.4.1. Example usage for Distributed Inference with llm-dCopy linkLink copied to clipboard!

2.4.1.1. Single-node GPU deploymentCopy linkLink copied to clipboard!

2.4.1.2. Multi-node deploymentCopy linkLink copied to clipboard!

2.4.1.3. Intelligent inference scheduler with KV cache routingCopy linkLink copied to clipboard!

2.5. Monitoring models on the single-model serving platformCopy linkLink copied to clipboard!

2.5.1. Viewing performance metrics for a deployed modelCopy linkLink copied to clipboard!

2.5.2. Viewing model-serving runtime metrics for the single-model serving platformCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. About KServe deployment modes
Copy link

2.2. Deploying models on the single-model serving platform
Copy link

2.3. Deploying a model stored in an OCI image by using the CLI
Copy link

2.4. Deploying models by using Distributed Inference with llm-d
Copy link

2.4.1. Example usage for Distributed Inference with llm-d
Copy link

2.4.1.1. Single-node GPU deployment
Copy link

2.4.1.2. Multi-node deployment
Copy link

2.4.1.3. Intelligent inference scheduler with KV cache routing
Copy link

2.5. Monitoring models on the single-model serving platform
Copy link

2.5.1. Viewing performance metrics for a deployed model
Copy link

2.5.2. Viewing model-serving runtime metrics for the single-model serving platform
Copy link