Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying models

The model serving platform is based on the KServe component and deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs).

2.1. Automatic selection of serving runtimes
Copier lien

When you deploy a model, OpenShift AI can automatically select the best serving runtime for your deployment. This feature allows you to efficiently deploy applications without needing to manually research runtime compatibility. The system determines the optimal runtime by analyzing the model type, model format, and selected hardware profile.

2.1.1. Hardware profile matching
Copier lien

The system suggests a runtime by matching the accelerator defined in your selected hardware profile with available runtimes. For example, if you select a hardware profile that uses an NVIDIA GPU accelerator, the system filters for compatible runtimes, such as vLLM NVIDIA GPU ServingRuntime for KServe.

Note

Automatic selection is available only if a hardware profile exists for the specific accelerator that you want to use.

2.1.2. Predictive model selection
Copier lien

For predictive models, you must select a Model format before the system can determine the appropriate serving runtime.

2.1.3. Selection limitations
Copier lien

The Auto-select option is displayed only when the system can identify a single, distinct match. If multiple serving runtime templates are defined for the same accelerator, the system cannot determine the best option automatically, and the auto-select option is not displayed for that hardware profile. In such cases, you must manually select a runtime.

2.1.4. Manual serving runtime selection
Copier lien

You can manually select a specific runtime from the Serving runtime list if the automatically selected option does not meet your needs. This option is useful when you require a specific version of a runtime or want to use a custom runtime that you have added to the platform. The Serving runtime list displays all global and project-scoped serving runtime templates available to you.

2.1.5. Administrator overrides
Copier lien

Cluster administrator settings can override standard hardware profile matching. If the Use distributed inference with llm-d by default when deploying generative models option is enabled in the administrator settings, the system defaults to the Distributed inference with llm-d runtime, regardless of other potential matches. This option is available in Settings > Cluster settings > General settings.

2.2. Deployment strategies for resource optimization
Copier lien

To optimize resource usage and manage downtime during model rollouts, you can configure the deployment strategy for your inference services. Choosing the appropriate strategy depends on your cluster’s available quotas, especially hardware accelerators such as GPUs, and your tolerance for service interruptions.

There are two primary deployment strategies available for model serving:

Rolling update

This strategy ensures zero downtime and continuous availability of the model. New inference service pods start while the existing pods are running. Traffic is switched to the new pods only after they are fully ready, and then the old pods are terminated.

However, rolling updates require increased resources like CPU, memory, and GPUs during the update process. Plan for approximately 200% of the pod requests as headroom during the transition because parallel instances exist briefly.

Recreate

This strategy prioritizes resource conservation over availability. All existing inference service pods are terminated before the new pods attempt to launch.

However, this method requires a period of downtime. The model endpoint is unavailable and returns errors between the termination of the old pod and the readiness of the new pod.

2.2.1. Choosing a deployment strategy
Copier lien

Choose the deployment strategy that best fits your availability requirements and resource quotas. The following table compares the rolling update and recreate strategies.

Expand

Strategy	Description	Resource impact	Recommended scenarios
Rolling update	Replaces pods gradually to ensure zero downtime. Traffic switches to new pods only after they are fully ready.	High: Requires approximately 200% of the request resources to host parallel instances during the transition.	Production workloads: Environments where the model must remain accessible without interruption. High-quota clusters: Namespaces with sufficient headroom to accommodate parallel instances.
Recreate	Terminates the old pod before starting the new one. Service is unavailable during the transition.	Low: Consumption does not exceed 100%. Prevents Insufficient Resources errors.	Resource-constrained environments: Projects using scarce hardware, such as high-end GPUs, where double allocation is not possible. Development and staging: Environments where downtime does not impact business operations. Batch processing: Workflows where immediate availability is not critical. Maintenance windows: Periods where service unavailability is expected.

Important

The Recreate strategy severs the connection to the old pod immediately. Ensure that your traffic routing gateway and client applications can handle a temporary gap in service before applying this strategy.

Note

The Recreate deployment strategy is available for all runtimes except Distributed inference with llm-d. If you select the Distributed inference with llm-d runtime, the deployment strategy options are not displayed and the system defaults to the Recreate strategy.

2.3. Deploying models on the model serving platform
Copier lien

You can deploy generative AI (gen AI) or predictive AI models on the model serving platform by using the Deploy a model wizard. The wizard allows you to configure your model, including specifying its location and type, selecting a serving runtime, assigning a hardware profile, and setting advanced configurations like external routes and token authentication.

To successfully deploy a model, you must meet the following prerequisites.

General prerequisites

You have logged in to Red Hat OpenShift AI.
You have installed KServe and enabled the model serving platform.
You have enabled a preinstalled or custom model-serving runtime.
You have created a project.
You have access to S3-compatible object storage, a URI-based repository, an OCI-compliant registry or a persistent volume claim (PVC) and have added a connection to your project. For more information about adding a connection, see Adding a connection to your project.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.

Runtime-specific prerequisites

Meet the requirements for the specific runtime you intend to use.

Caikit-TGIS runtime
- To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
vLLM NVIDIA GPU ServingRuntime for KServe
- To use the vLLM NVIDIA GPU ServingRuntime for KServe runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
vLLM CPU ServingRuntime for KServe
- To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. You cannot use GPU accelerators with IBM Z and IBM Power architectures. For more information, see Red Hat OpenShift Multi Architecture Component Availability Matrix.
vLLM Intel Gaudi Accelerator ServingRuntime for KServe
- To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation in the AMD documentation and Working with hardware profiles.
vLLM AMD GPU ServingRuntime for KServe
- To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
vLLM Spyre AI Accelerator ServingRuntime for KServe

Important

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 3.3 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

To use the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime on x86, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.
- vLLM Spyre s390x ServingRuntime for KServe
To use the vLLM Spyre s390x ServingRuntime for KServe runtime on IBM Z, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

Procedure

In the left menu, click Projects.
Click the name of the project that you want to deploy a model in.
A project details page opens.
Click the Deployments tab.
Click Deploy model.
The Deploy a model wizard opens.
In the Model details section, provide information about the model:
1. From the Model location list, specify where your model is stored and complete the connection detail fields.
  Note
  The OCI-compliant registry, S3 compatible object storage, and URI options are preinstalled connection types. Additional options might be available if your OpenShift AI administrator added them.
  If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, the Cluster storage option becomes available in the Model location list. Use this option to select the PVC and specify the path to the model file.
2. From the Model type list, select the type of model that you are deploying, Predictive or Generative AI model.
3. Click Next.
In the Model deployment section, configure the deployment:
1. In the Model deployment name field, enter a unique name for your model deployment.
2. In the Description field, enter a description of your deployment.
3. From the Hardware profile list, select a hardware profile.
4. Optional: To modify the default resource allocation, click Customize resource requests and limits and enter new values for the CPU and Memory requests and limits.
5. In the Serving runtime field, select one of the following options:
  - Auto-select the best runtime for your model based on model type, model format, and hardware profile
    The system analyzes the selected model framework and your available hardware profiles to recommend a serving runtime.
  - Select from a list of serving runtimes, including custom ones
    Select this option to manually choose a runtime from the list of global and project-scoped serving runtime templates.
    For more information about how the system determines the best runtime and administrator overrides, see Automatic selection of serving runtimes.
6. Optional: If you selected a Predictive model type, select a framework from the Model framework (name - version) list. This field is hidden for Generative AI models.
7. In the Number of model server replicas to deploy field, specify a value.
8. Click Next.
In the Advanced settings section, configure advanced options:
1. Optional: (Generative AI models only) Select the Add as AI asset endpoint checkbox if you want to add your model’s endpoint to the Gen AI studio AI asset endpoints page.
  1. In the Use case field, enter the types of tasks that your model performs, such as chat, multimodal, or natural language processing.
    Note
    You must add your model as an AI asset endpoint to test your model on the Gen AI studio playground page.
    If you enabled the endpoint, enter the types of tasks that your model performs in the Use case field.
2. Optional: Select the Model access checkbox to make your model deployment available through an external route.
3. Optional: To require token authentication for inference requests to the deployed model, select Require token authentication.
4. In the Service account name field, enter the service account name that the token will be generated for.
5. To add an additional service account, click Add a service account and enter another service account name.
6. Optional: Select Add custom runtime arguments or Add custom runtime environment variables to add configuration parameters to your deployment.
7. In the Deployment strategy section, select Rolling update or Recreate. For more information about deployment strategies, see Deployment strategies for resource optimization.
  Note
  The Recreate deployment strategy is available for all runtimes except Distributed inference with llm-d. If you select the Distributed inference with llm-d runtime, the deployment strategy options are not displayed and the system defaults to the Recreate strategy.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Deployments tab for the project, and on the Deployments page of the dashboard with a checkmark in the Status column.

2.4. Deploying models by using the MLServer runtime
Copier lien

Deploy models with the MLServer ServingRuntime for KServe by specifying the model implementation and URI using environment variables in the Deploy a model wizard.

Important

MLServer ServingRuntime for KServe is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have installed KServe and enabled the model serving platform.
The MLServer ServingRuntime for KServe is enabled in your cluster.
You have created a project.
Your model is stored in a location accessible to the model server and you have added a connection to your project:
- S3-compatible object storage
- Persistent Volume Claim
You are deploying a model that uses one of the supported MLServer implementations:
- Scikit-learn
- XGBoost
- LightGBM

Note

The model name is automatically exported from the model deployment name. You do not need to set a MLSERVER_MODEL_NAME environment variable. If you manually configure MLSERVER_MODEL_NAME, you must set the value to match your model deployment name.

Important

You can also use MLServer’s model-settings.json file for model configuration. If a model-settings.json file is present alongside your model file, the MLServer runtime loads configuration values from that file and overrides any environment variables you set through the deployment wizard.

Procedure

Deploy the model using the Deploy a model wizard.
For complete deployment instructions, see Deploying models on the model serving platform.

In the Advanced settings section of the wizard, configure the environment variables:

Under Configuration parameters, select the Add custom runtime environment variables checkbox.
Click Add variable.

Add the appropriate variables for your model framework as shown in the following examples:

Note

For MLSERVER_MODEL_URI, you can specify either:

Absolute path: An absolute path to a specific model file such as /mnt/models/model.json
Directory path: A directory path such as /mnt/models. If you use a directory path, your model file must use one of the following well-known filenames:
- XGBoost: model.bst, model.json, model.ubj
- LightGBM: model.bst
- Scikit-learn: model.joblib, model.pickle, model.pkl

Expand

Table 2.1. For an XGBoost model
Key	Value
MLSERVER_MODEL_IMPLEMENTATION	`mlserver_xgboost.XGBoostModel`
MLSERVER_MODEL_URI	`/mnt/models/model.json`

Expand

Table 2.2. For a Scikit-learn model
Key	Value
MLSERVER_MODEL_IMPLEMENTATION	`mlserver_sklearn.SKLearnModel`
MLSERVER_MODEL_URI	`/mnt/models/model.joblib`

Expand

Table 2.3. For a LightGBM model
Key	Value
MLSERVER_MODEL_IMPLEMENTATION	`mlserver_lightgbm.LightGBMModel`
MLSERVER_MODEL_URI	`/mnt/models/model.bst`

Verification

Confirm that the deployed model is shown on the Deployments tab for the project with a checkmark in the Status column.
Test the model by querying the ready endpoint:
```
$ curl -H "Content-Type: application/json" \
https://<inference_endpoint_url>/v2/models/<model_name>/ready
```
where:
<inference_endpoint_url>
Specifies the inference endpoint URL displayed in the model details.
<model_name>
Specifies the name of your deployed model.

2.5. Deploying a model stored in an OCI image by using the CLI
Copier lien

You can deploy a model that is stored in an OCI image from the command line interface.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note

By default in KServe, models are exposed outside the cluster and not protected with authentication.

Prerequisites

You have stored a model in an OCI image as described in Storing a model in an OCI image.
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
You are logged in to your OpenShift cluster.

Procedure

Create a project to deploy the model:
```
oc new-project oci-model-example
```
Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:
```
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
```

Verify that the ServingRuntime named kserve-ovms is created:

oc get servingruntimes

The command should return output similar to the following:

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

2.6. Deploying models by using Distributed Inference with llm-d
Copier lien

Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators.

Key features of Distributed Inference with llm-d include:

Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving.
Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL).
Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure.

2.6.1. Enabling Distributed Inference with llm-d
Copier lien

This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService.

Prerequisites

You have enabled the model serving platform.
You have access to an OpenShift cluster running version 4.19.9 or later.
OpenShift Service Mesh v2 is not installed in the cluster.
Your cluster administrator has created a GatewayClass and a Gateway named openshift-ai-inference in the openshift-ingress namespace as described in Gateway API with OpenShift Container Platform Networking.
Important
Review the Gateway API deployment topologies. Only use shared Gateways across trusted namespaces.
Your cluster administrator has installed the LeaderWorkerSet Operator in OpenShift. For more information, see the Leader Worker Set Operator documentation.
If you are running OpenShift on a bare-metal cluster, your cluster administrator has an external entry point for the openshift-ai-inference Gateway service.
Note
By default, the Inference Gateway uses type: LoadBalancer. If the cluster does not already include support for LoadBalancer services, you can use the OpenShift option described in Load balancing with MetalLB.
You have enabled authentication as described in Configuring authentication for Distributed Inference with llm-d.

Procedure

Log in to the OpenShift console as a developer.
Create the LLMInferenceService CR with the following information:
```
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: sample-llm-inference-service
spec:
  replicas: 2
  model:
    uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
    name: RedHatAI/Qwen3-8B-FP8-dynamic
  router:
    route: {}
    gateway: {}
    scheduler: {}
    template:
      containers:
      - name: main
        resources:
          limits:
            cpu: '4'
            memory: 32Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: '2'
            memory: 16Gi
            nvidia.com/gpu: "1"
```
Customize the following parameters in the spec section of the inference service:
- replicas - Specify the number of replicas.
- model - Specify the URI to the model based on how the model is stored (uri) and the model name to use in chat completion requests (name).
  - S3 bucket: s3://<bucket-name>/<object-key>
  - Persistent volume claim (PVC): pvc://<claim-name>/<pvc-path>
  - OCI container image: oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
  - HuggingFace: hf://<model>/<optional-hash>
- router - Provide an HTTPRoute and gateway, or leave blank to automatically create one.
Save the file.

2.6.2. Configuring authentication for Distributed Inference with llm-d using Red Hat Connectivity Link
Copier lien

Red Hat Connectivity Link provides Kubernetes-native authentication and authorization capabilities for Distributed Inference with llm-d inference endpoints. Connectivity Link works with the gateway to intercept incoming traffic before it reaches the vLLM inference service, validating the requests based on authentication tokens and authorization policies. For more information about Connectivity Link concepts and capabilities, see Introduction to Connectivity Link.

Prerequisites

You have installed Red Hat Connectivity Link version 1.1.1 or later. For more information, see Installing Connectivity Link on OpenShift.
You have access to the OpenShift CLI (oc).
The ServiceAccount has permission to get the corresponding LLMInferenceService and you have generated a JSON web token (JWT).

Procedure

Create the Kuadrant custom resource (CR) to set up required objects:

oc apply -f - <<EOF
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
  name: kuadrant
  namespace: kuadrant-system
EOF

Wait for Kuadrant to become ready:

oc wait Kuadrant -n kuadrant-system kuadrant --for=condition=Ready --timeout=10m

Add the ServingCert annotation to the Authorino Service:

oc annotate svc/authorino-authorino-authorization  service.beta.openshift.io/serving-cert-secret-name=authorino-server-cert -n kuadrant-system

Wait for the secret to be created:
```
sleep 2
```

Update Authorino to enable SSL:

oc apply -f - <<EOF
apiVersion: operator.authorino.kuadrant.io/v1beta1
kind: Authorino
metadata:
  name: authorino
  namespace: kuadrant-system
spec:
  replicas: 1
  clusterWide: true
  listener:
    tls:
      enabled: true
      certSecretRef:
        name: authorino-server-cert
  oidcServer:
    tls:
      enabled: false
EOF

Verify that the Authorino pods are ready:

oc wait --for=condition=ready pod -l authorino-resource=authorino -n kuadrant-system --timeout 150s

If OpenShift AI was installed before installing Connectivity Link and Kuadrant, restart the controllers:

oc delete pod -n redhat-ods-applications -l app=odh-model-controller
oc delete pod -n redhat-ods-applications -l control-plane=kserve-controller-manager

2.6.3. Enabling authentication and authorization for an LLM inference service
Copier lien

In OpenShift AI 3.0 and later, authentication and authorization are automatically enabled for LLMInferenceService resources when Red Hat Connectivity Link is configured. You can use the security.opendatahub.io/enable-auth: "true" annotation to explicitly enable authentication, such as re-enabling it after it was previously disabled.

Prerequisites

You have configured Red Hat Connectivity Link for Distributed Inference with llm-d as described in Configuring authentication for Distributed Inference with llm-d using Red Hat Connectivity Link.
You have created an LLMInferenceService resource as described in Enabling Distributed Inference with llm-d.
You have access to the OpenShift CLI (oc).

Procedure

By default, authentication is enabled automatically. To explicitly enable authentication or to re-enable it after disabling, annotate your LLMInferenceService resource:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: sample-llm-inference-service
  annotations:
    security.opendatahub.io/enable-auth: "true"
spec:
  replicas: 2
  model:
    uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
    name: RedHatAI/Qwen3-8B-FP8-dynamic
  router:
    route: {}
    gateway: {}
    scheduler: {}
    template:
      containers:
      - name: main
        resources:
          limits:
            cpu: '4'
            memory: 32Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: '2'
            memory: 16Gi
            nvidia.com/gpu: "1"

Apply the configuration:

oc apply -f <llm-inference-service-file>.yaml

Verification

Confirm that the LLMInferenceService resource has the annotation:

oc get llminferenceservice sample-llm-inference-service -o jsonpath='{.metadata.annotations.security\.opendatahub\.io/enable-auth}'

The command returns true.

Verify that the inference service is protected by attempting to access it without authentication:
```
curl -v https://<inference-endpoint-url>/v1/models
```
The request returns a 401 Unauthorized response, confirming that unauthenticated requests are rejected.

2.6.4. Example usage for Distributed Inference with llm-d
Copier lien

These examples show how to use Distributed Inference with llm-d in common scenarios.

2.6.4.1. Single-node GPU deployment
Copier lien

Use single-GPU-per-replica deployment patterns for development, testing, or production deployments of smaller models, such as 7-billion-parameter models.

For examples using single-node GPU deployments, see Single-Node GPU Deployment Examples.

2.6.4.2. Multi-node deployment
Copier lien

For examples using multi-node deployments, see DeepSeek-R1 Multi-Node Deployment Examples.

2.6.4.3. Intelligent inference scheduler with KV cache routing
Copier lien

You can configure the scheduler to track key-value (KV) cache blocks across inference endpoints and route requests to the endpoint with the highest cache hit rate. This configuration improves throughput and reduces latency by maximizing cache reuse.

For an example, see Precise Prefix KV Cache Routing.

2.7. Monitoring models
Copier lien

You can monitor models that are deployed on the model serving platform to view performance and resource usage metrics.

2.7.1. Viewing performance metrics for a deployed model
Copier lien

You can monitor the following metrics for a specific model that is deployed on the model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

You have installed Red Hat OpenShift AI.
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the model serving platform.
You have logged in to Red Hat OpenShift AI.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have deployed a model on the model serving platform by using a preinstalled runtime.
Note
Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

From the OpenShift AI dashboard navigation menu, click Projects.
The Projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Deployments tab.
Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

2.7.2. Viewing model-serving runtime metrics for the model serving platform
Copier lien

When a cluster administrator has configured monitoring for the model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

A cluster administrator has configured monitoring for the model serving platform.
You have been assigned the monitoring-rules-view role. For more information, see Granting users permission to configure monitoring for user-defined projects.
You are familiar with how to monitor project metrics in the OpenShift web console. For more information, see Monitoring your project metrics.

Procedure

Log in to the OpenShift web console.
Switch to the Developer perspective.
In the left menu, click Observe.
As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.
1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
  sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
  Note
  Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.
2. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
  sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying models

2.1. Automatic selection of serving runtimes
Copier lien

2.1.1. Hardware profile matching
Copier lien

2.1.2. Predictive model selection
Copier lien

2.1.3. Selection limitations
Copier lien

2.1.4. Manual serving runtime selection
Copier lien

2.1.5. Administrator overrides
Copier lien

2.2. Deployment strategies for resource optimization
Copier lien

2.2.1. Choosing a deployment strategy
Copier lien

2.3. Deploying models on the model serving platform
Copier lien

2.4. Deploying models by using the MLServer runtime
Copier lien

2.5. Deploying a model stored in an OCI image by using the CLI
Copier lien

2.6. Deploying models by using Distributed Inference with llm-d
Copier lien

2.6.1. Enabling Distributed Inference with llm-d
Copier lien

2.6.2. Configuring authentication for Distributed Inference with llm-d using Red Hat Connectivity Link
Copier lien

2.6.3. Enabling authentication and authorization for an LLM inference service
Copier lien

2.6.4. Example usage for Distributed Inference with llm-d
Copier lien

2.6.4.1. Single-node GPU deployment
Copier lien

2.6.4.2. Multi-node deployment
Copier lien

2.6.4.3. Intelligent inference scheduler with KV cache routing
Copier lien

2.7. Monitoring models
Copier lien

2.7.1. Viewing performance metrics for a deployed model
Copier lien

2.7.2. Viewing model-serving runtime metrics for the model serving platform
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Rendre l’open source plus inclusif

À propos de la documentation Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying models

2.1. Automatic selection of serving runtimesCopier lienLien copié sur presse-papiers!

2.1.1. Hardware profile matchingCopier lienLien copié sur presse-papiers!

2.1.2. Predictive model selectionCopier lienLien copié sur presse-papiers!

2.1.3. Selection limitationsCopier lienLien copié sur presse-papiers!

2.1.4. Manual serving runtime selectionCopier lienLien copié sur presse-papiers!

2.1.5. Administrator overridesCopier lienLien copié sur presse-papiers!

2.2. Deployment strategies for resource optimizationCopier lienLien copié sur presse-papiers!

2.2.1. Choosing a deployment strategyCopier lienLien copié sur presse-papiers!

2.3. Deploying models on the model serving platformCopier lienLien copié sur presse-papiers!

2.4. Deploying models by using the MLServer runtimeCopier lienLien copié sur presse-papiers!

2.5. Deploying a model stored in an OCI image by using the CLICopier lienLien copié sur presse-papiers!

2.6. Deploying models by using Distributed Inference with llm-dCopier lienLien copié sur presse-papiers!

2.6.1. Enabling Distributed Inference with llm-dCopier lienLien copié sur presse-papiers!

2.6.2. Configuring authentication for Distributed Inference with llm-d using Red Hat Connectivity LinkCopier lienLien copié sur presse-papiers!

2.6.3. Enabling authentication and authorization for an LLM inference serviceCopier lienLien copié sur presse-papiers!

2.6.4. Example usage for Distributed Inference with llm-dCopier lienLien copié sur presse-papiers!

2.6.4.1. Single-node GPU deploymentCopier lienLien copié sur presse-papiers!

2.6.4.2. Multi-node deploymentCopier lienLien copié sur presse-papiers!

2.6.4.3. Intelligent inference scheduler with KV cache routingCopier lienLien copié sur presse-papiers!

2.7. Monitoring modelsCopier lienLien copié sur presse-papiers!

2.7.1. Viewing performance metrics for a deployed modelCopier lienLien copié sur presse-papiers!

2.7.2. Viewing model-serving runtime metrics for the model serving platformCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Rendre l’open source plus inclusif

À propos de la documentation Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Automatic selection of serving runtimes
Copier lien

2.1.1. Hardware profile matching
Copier lien

2.1.2. Predictive model selection
Copier lien

2.1.3. Selection limitations
Copier lien

2.1.4. Manual serving runtime selection
Copier lien

2.1.5. Administrator overrides
Copier lien

2.2. Deployment strategies for resource optimization
Copier lien

2.2.1. Choosing a deployment strategy
Copier lien

2.3. Deploying models on the model serving platform
Copier lien

2.4. Deploying models by using the MLServer runtime
Copier lien

2.5. Deploying a model stored in an OCI image by using the CLI
Copier lien

2.6. Deploying models by using Distributed Inference with llm-d
Copier lien

2.6.1. Enabling Distributed Inference with llm-d
Copier lien

2.6.2. Configuring authentication for Distributed Inference with llm-d using Red Hat Connectivity Link
Copier lien

2.6.3. Enabling authentication and authorization for an LLM inference service
Copier lien

2.6.4. Example usage for Distributed Inference with llm-d
Copier lien

2.6.4.1. Single-node GPU deployment
Copier lien

2.6.4.2. Multi-node deployment
Copier lien

2.6.4.3. Intelligent inference scheduler with KV cache routing
Copier lien

2.7. Monitoring models
Copier lien

2.7.1. Viewing performance metrics for a deployed model
Copier lien

2.7.2. Viewing model-serving runtime metrics for the model serving platform
Copier lien