Chapter 2. Deploying models on the single-model serving platform
The single-model serving platform deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs).
The platform is based on the KServe component and offers two deployment modes:
- KServe RawDeployment: Uses a standard deployment method that does not require serverless dependencies.
- Knative Serverless: Uses Red Hat OpenShift Serverless for deployments that can automatically scale based on demand.
2.1. About KServe deployment modes Copy linkLink copied to clipboard!
KServe offers two deployment modes for serving models. The default mode, Knative Serverless, is based on the open-source Knative project and provides powerful autoscaling capabilities. It integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh. Alternatively, the KServe RawDeployment mode offers a more traditional deployment method with fewer dependencies.
Before you choose an option, understand how your initial configuration affects future deployments:
- If you configure for Knative Serverless: You can use both Knative Serverless and KServe RawDeployment modes.
- If you configure for KServe RawDeployment only: You can only use the KServe RawDeployment mode.
Use the following comparison to choose the option that best fits your requirements.
| Criterion | Knative Serverless | KServe RawDeployment |
|---|---|---|
| Default mode | Yes | No |
| Recommended use case | Most workloads. | Custom serving setups or models that must remain active. |
| Autoscaling |
|
|
| Dependencies |
|
None; uses standard Kubernetes resources such as |
| Configuration flexibility | Has some customization limitations inherited from Knative compared to raw Kubernetes deployments. |
Provides full control over pod specifications because it uses standard Kubernetes |
| Resource footprint | Larger, due to the additional dependencies required for serverless functionality. | Smaller. |
| Setup complexity | Might require additional configuration in setup and management. If Serverless is not already installed on the cluster, you must install and configure it. | Requires a simpler setup with fewer dependencies. |
2.2. Deploying models on the single-model serving platform Copy linkLink copied to clipboard!
When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.
You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.
To successfully deploy a model, you must meet the following prerequisites.
General prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have installed KServe and enabled the single-model serving platform.
- (Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider. For more information, see Adding an authorization provider for the single-model serving platform.
- (Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider.
- You have created a data science project.
- You have access to S3-compatible object storage, a URI-based repository, an OCI-compliant registry or a persistent volume claim (PVC) and have added a connection to your data science project. For more information about adding a connection, see Adding a connection to your data science project.
- If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
Runtime-specific prerequisites
Meet the requirements for the specific runtime you intend to use.
Caikit-TGIS runtime
- To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
vLLM NVIDIA GPU ServingRuntime for KServe
- To use the vLLM NVIDIA GPU ServingRuntime for KServe runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
vLLM CPU ServingRuntime for KServe
- To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. You cannot use GPU accelerators with IBM Z and IBM Power architectures. For more information, see Red Hat OpenShift Multi Architecture Component Availability Matrix.
vLLM Intel Gaudi Accelerator ServingRuntime for KServe
- To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation in the AMD documentation and Working with hardware profiles.
vLLM AMD GPU ServingRuntime for KServe
- To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
vLLM Spyre AI Accelerator ServingRuntime for KServe
ImportantSupport for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
- To use the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime on x86, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.
Procedure
- In the left menu, click Data science projects.
Click the name of the project that you want to deploy a model in.
A project details page opens.
- Click the Models tab.
- Click Select single-model to deploy your model using single-model serving.
Click the Deploy model button.
The Deploy model dialog opens.
- In the Model deployment name field, enter a unique name for the model that you are deploying.
- In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
- From the Model framework (name - version) list, select a value if applicable.
- From the Deployment mode list, select KServe RawDeployment or Knative Serverless. For more information about deployment modes, see About KServe deployment modes.
- In the Number of model server replicas to deploy field, specify a value.
The following options are only available if you have created a hardware profile:
From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings
Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.- Optional: To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
- Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
- Select Require token authentication.
- In the Service account name field, enter the service account name that the token will be generated for.
- To add an additional service account, click Add a service account and enter another service account name.
To specify the location of your model, select a Connection type that you have added. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your OpenShift AI administrator added them.
For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.
ImportantThe OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.
NoteIf you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.
- Complete the connection detail fields.
Optional: If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, use the Existing cluster storage option to select the PVC and specify the path to the model file.
ImportantIf your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
(Optional) Customize the runtime parameters in the Configuration parameters section:
- Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
Modify the values in Additional environment variables to define variables in the model’s environment.
The Configuration parameters section shows predefined serving runtime parameters, if any are available.
NoteDo not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
- Click Deploy.
Verification
- Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
2.3. Deploying a model stored in an OCI image by using the CLI Copy linkLink copied to clipboard!
You can deploy a model that is stored in an OCI image from the command line interface.
The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.
By default in KServe, models are exposed outside the cluster and not protected with authentication.
Prerequisites
- You have stored a model in an OCI image as described in Storing a model in an OCI image.
- If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
- You are logged in to your OpenShift cluster.
Procedure
Create a project to deploy the model:
oc new-project oci-model-example
oc new-project oci-model-exampleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the OpenShift AI Applications project
kserve-ovmstemplate to create aServingRuntimeresource and configure the OpenVINO model server in the new project:oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
ServingRuntimenamedkserve-ovmsis created:oc get servingruntimes
oc get servingruntimesCopy to Clipboard Copied! Toggle word wrap Toggle overflow The command should return output similar to the following:
NAME DISABLED MODELTYPE CONTAINERS AGE kserve-ovms openvino_ir kserve-container 1m
NAME DISABLED MODELTYPE CONTAINERS AGE kserve-ovms openvino_ir kserve-container 1mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create an
InferenceServiceYAML resource, depending on whether the model is stored from a private or a public OCI repository:For a model stored in a public OCI repository, create an
InferenceServiceYAML file with the following values, replacing<user_name>,<repository_name>, and<tag_name>with values specific to your environment:Copy to Clipboard Copied! Toggle word wrap Toggle overflow For a model stored in a private OCI repository, create an
InferenceServiceYAML file that specifies your pull secret in thespec.predictor.imagePullSecretsfield, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you create the
InferenceServiceresource, KServe deploys the model stored in the OCI image referred to by thestorageUrifield.
Verification
Check the status of the deployment:
oc get inferenceservice
oc get inferenceservice
The command should return output that includes information, such as the URL of the deployed model and its readiness state.
2.4. Deploying models by using Distributed Inference with llm-d Copy linkLink copied to clipboard!
Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators.
Key features of Distributed Inference with llm-d include:
- Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving.
- Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL).
- Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure.
Serving models using Distributed Inference with llm-d on Red Hat OpenShift AI consists of the following steps:
Installing OpenShift AI.
NoteBecause KServe Serverless conflicts with the Gateway API used for Distributed Inference with llm-d, KServe Serverless is not supported on the same cluster. Instead, use KServe RawDeployment.
- Enabling the single model serving platform.
- Enabling Distributed Inference with llm-d on a Kubernetes cluster.
- Creating an LLMInferenceService Custom Resource (CR).
- Deploying a model.
This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService.
Prerequisites
- You have enabled the single model-serving platform.
- You have access to an OpenShift cluster running version 4.19.9 or later.
- OpenShift Service Mesh v2 is not installed in the cluster.
-
You have created a
GatewayClassand aGatewaynamedopenshift-ai-inferencein theopenshift-ingressnamespace as described in Gateway API with OpenShift Container Platform Networking. -
You have installed the
LeaderWorkerSetOperator in OpenShift. For more information, see the OpenShift documentation.
Procedure
- Log in to the OpenShift console as a cluster administrator.
Create a data science cluster initialization (DSCI) and set the
serviceMesh.managementStatetoremoved, as shown in the following example:serviceMesh: ... managementState: Removed
serviceMesh: ... managementState: RemovedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a data science cluster (DSC) with the following information set in
kserveandserving:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
LLMInferenceServiceCR with the following information:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Customize the following parameters in the
specsection of the inference service:-
replicas- Specify the number of replicas. model- Provide the URI to the model based on how the model is stored (uri) and the model name to use in chat completion requests (name).-
S3 bucket:
s3://<bucket-name>/<object-key> -
Persistent volume claim (PVC):
pvc://<claim-name>/<pvc-path> -
OCI container image:
oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest> -
HuggingFace:
hf://<model>/<optional-hash>
-
S3 bucket:
-
router- Provide an HTTPRoute and gateway, or leave blank to automatically create one.
-
- Save the file.
2.4.1. Example usage for Distributed Inference with llm-d Copy linkLink copied to clipboard!
These examples show how to use Distributed Inference with llm-d in common scenarios.
Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
2.4.1.1. Single-node GPU deployment Copy linkLink copied to clipboard!
Use single-GPU-per-replica deployment patterns for development, testing, or production deployments of smaller models, such as 7-billion-parameter models.
You can use the following examples for single-node GPU deployments:
2.4.1.2. Multi-node deployment Copy linkLink copied to clipboard!
You can use the following examples for multi-node deployments:
2.4.1.3. Intelligent inference scheduler with KV cache routing Copy linkLink copied to clipboard!
You can configure the scheduler to track key-value (KV) cache blocks across inference endpoints and route requests to the endpoint with the highest cache hit rate. This configuration improves throughput and reduces latency by maximizing cache reuse.
For an example, see Precise Prefix KV Cache Routing.
2.5. Monitoring models on the single-model serving platform Copy linkLink copied to clipboard!
You can monitor models that are deployed on the single-model serving platform to view performance and resource usage metrics.
2.5.1. Viewing performance metrics for a deployed model Copy linkLink copied to clipboard!
You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:
- Number of requests - The number of requests that have failed or succeeded for a specific model.
- Average response time (ms) - The average time it takes a specific model to respond to requests.
- CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
- Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.
You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.
Prerequisites
- You have installed Red Hat OpenShift AI.
- A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
- You have logged in to Red Hat OpenShift AI.
The following dashboard configuration options are set to the default values as shown:
disablePerformanceMetrics:false disableKServeMetrics:false
disablePerformanceMetrics:false disableKServeMetrics:falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow For more information about setting dashboard configuration options, see Customizing the dashboard.
You have deployed a model on the single-model serving platform by using a preinstalled runtime.
NoteMetrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.
Procedure
From the OpenShift AI dashboard navigation menu, click Data science projects.
The Data science projects page opens.
- Click the name of the project that contains the data science models that you want to monitor.
- In the project details page, click the Models tab.
- Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
- Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.
Verification
The Endpoint performance tab shows graphs of metrics for the model.
2.5.2. Viewing model-serving runtime metrics for the single-model serving platform Copy linkLink copied to clipboard!
When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.
Prerequisites
- A cluster administrator has configured monitoring for the single-model serving platform.
-
You have been assigned the
monitoring-rules-viewrole. For more information, see Granting users permission to configure monitoring for user-defined projects. - You are familiar with how to monitor project metrics in the OpenShift web console. For more information, see Monitoring your project metrics.
Procedure
- Log in to the OpenShift web console.
- Switch to the Developer perspective.
- In the left menu, click Observe.
As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.
The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteCertain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.
The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))Copy to Clipboard Copied! Toggle word wrap Toggle overflow