Chapter 2. Deploying models on the single-model serving platform


The single-model serving platform deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs).

The platform is based on the KServe component and offers two deployment modes:

  • KServe RawDeployment: Uses a standard deployment method that does not require serverless dependencies.
  • Knative Serverless: Uses Red Hat OpenShift Serverless for deployments that can automatically scale based on demand.

2.1. About KServe deployment modes

KServe offers two deployment modes for serving models. The default mode, Knative Serverless, is based on the open-source Knative project and provides powerful autoscaling capabilities. It integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh. Alternatively, the KServe RawDeployment mode offers a more traditional deployment method with fewer dependencies.

Before you choose an option, understand how your initial configuration affects future deployments:

  • If you configure for Knative Serverless: You can use both Knative Serverless and KServe RawDeployment modes.
  • If you configure for KServe RawDeployment only: You can only use the KServe RawDeployment mode.

Use the following comparison to choose the option that best fits your requirements.

Expand
Table 2.1. Comparison of deployment modes
CriterionKnative ServerlessKServe RawDeployment

Default mode

Yes

No

Recommended use case

Most workloads.

Custom serving setups or models that must remain active.

Autoscaling

  • Scales up automatically based on request volume.
  • Supports scaling down to zero when idle to save costs.
  • No built-in autoscaling; you can configure Kubernetes Event-Driven Autoscaling (KEDA) or Horizontal Pod Autoscaler (HPA) on your deployment.
  • Does not support scaling to zero by default, which might result in higher costs during periods of low traffic.

Dependencies

  • Red Hat OpenShift Serverless Operator
  • Red Hat OpenShift Service Mesh
  • Authorino. Only required only if you enable token authentication and external routes.

None; uses standard Kubernetes resources such as Deployment, Service, and Horizontal Pod Autoscaler.

Configuration flexibility

Has some customization limitations inherited from Knative compared to raw Kubernetes deployments.

Provides full control over pod specifications because it uses standard Kubernetes Deployment resources.

Resource footprint

Larger, due to the additional dependencies required for serverless functionality.

Smaller.

Setup complexity

Might require additional configuration in setup and management. If Serverless is not already installed on the cluster, you must install and configure it.

Requires a simpler setup with fewer dependencies.

When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.

You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

To successfully deploy a model, you must meet the following prerequisites.

General prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have installed KServe and enabled the single-model serving platform.
  • (Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider. For more information, see Adding an authorization provider for the single-model serving platform.
  • (Knative Serverless deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider.
  • You have created a data science project.
  • You have access to S3-compatible object storage, a URI-based repository, an OCI-compliant registry or a persistent volume claim (PVC) and have added a connection to your data science project. For more information about adding a connection, see Adding a connection to your data science project.
  • If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.

Runtime-specific prerequisites

Meet the requirements for the specific runtime you intend to use.

  • Caikit-TGIS runtime

  • vLLM NVIDIA GPU ServingRuntime for KServe

  • vLLM CPU ServingRuntime for KServe

  • vLLM Intel Gaudi Accelerator ServingRuntime for KServe

  • vLLM AMD GPU ServingRuntime for KServe

  • vLLM Spyre AI Accelerator ServingRuntime for KServe

    Important

    Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

    For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Procedure

  1. In the left menu, click Data science projects.
  2. Click the name of the project that you want to deploy a model in.

    A project details page opens.

  3. Click the Models tab.
  4. Click Select single-model to deploy your model using single-model serving.
  5. Click the Deploy model button.

    The Deploy model dialog opens.

  6. In the Model deployment name field, enter a unique name for the model that you are deploying.
  7. In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
  8. From the Model framework (name - version) list, select a value if applicable.
  9. From the Deployment mode list, select KServe RawDeployment or Knative Serverless. For more information about deployment modes, see About KServe deployment modes.
  10. In the Number of model server replicas to deploy field, specify a value.
  11. The following options are only available if you have created a hardware profile:

    1. From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.

      Important

      By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.

    2. Optional: To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
  12. Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
  13. To require token authentication for inference requests to the deployed model, perform the following actions:

    1. Select Require token authentication.
    2. In the Service account name field, enter the service account name that the token will be generated for.
    3. To add an additional service account, click Add a service account and enter another service account name.
  14. To specify the location of your model, select a Connection type that you have added. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your OpenShift AI administrator added them.

    1. For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.

      Important

      The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.

    2. For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.

      Note

      If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.

    3. Complete the connection detail fields.
    4. Optional: If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, use the Existing cluster storage option to select the PVC and specify the path to the model file.

      Important

      If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.

  15. (Optional) Customize the runtime parameters in the Configuration parameters section:

    1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
    2. Modify the values in Additional environment variables to define variables in the model’s environment.

      The Configuration parameters section shows predefined serving runtime parameters, if any are available.

      Note

      Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.

  16. Click Deploy.

Verification

  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

You can deploy a model that is stored in an OCI image from the command line interface.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note

By default in KServe, models are exposed outside the cluster and not protected with authentication.

Prerequisites

  • You have stored a model in an OCI image as described in Storing a model in an OCI image.
  • If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
  • You are logged in to your OpenShift cluster.

Procedure

  1. Create a project to deploy the model:

    oc new-project oci-model-example
    Copy to Clipboard Toggle word wrap
  2. Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:

    oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
    Copy to Clipboard Toggle word wrap
  3. Verify that the ServingRuntime named kserve-ovms is created:

    oc get servingruntimes
    Copy to Clipboard Toggle word wrap

    The command should return output similar to the following:

    NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
    kserve-ovms              openvino_ir   kserve-container   1m
    Copy to Clipboard Toggle word wrap
  4. Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

    • For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      metadata:
        name: sample-isvc-using-oci
      spec:
        predictor:
          model:
            runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
            modelFormat:
              name: onnx
            storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
            resources:
              requests:
                memory: 500Mi
                cpu: 100m
                # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
              limits:
                memory: 4Gi
                cpu: 500m
                # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
      Copy to Clipboard Toggle word wrap
    • For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      metadata:
        name: sample-isvc-using-private-oci
      spec:
        predictor:
          model:
            runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
            modelFormat:
              name: onnx
            storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
            resources:
              requests:
                memory: 500Mi
                cpu: 100m
                # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
              limits:
                memory: 4Gi
                cpu: 500m
                # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
          imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
          - name: <pull-secret-name>
      Copy to Clipboard Toggle word wrap

      After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice
Copy to Clipboard Toggle word wrap

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

Important

Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators.

Key features of Distributed Inference with llm-d include:

  • Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving.
  • Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL).
  • Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure.

Serving models using Distributed Inference with llm-d on Red Hat OpenShift AI consists of the following steps:

  1. Installing OpenShift AI.

    Note

    Because KServe Serverless conflicts with the Gateway API used for Distributed Inference with llm-d, KServe Serverless is not supported on the same cluster. Instead, use KServe RawDeployment.

  2. Enabling the single model serving platform.
  3. Enabling Distributed Inference with llm-d on a Kubernetes cluster.
  4. Creating an LLMInferenceService Custom Resource (CR).
  5. Deploying a model.

This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService.

Prerequisites

  • You have enabled the single model-serving platform.
  • You have access to an OpenShift cluster running version 4.19.9 or later.
  • OpenShift Service Mesh v2 is not installed in the cluster.
  • You have created a GatewayClass and a Gateway named openshift-ai-inference in the openshift-ingress namespace as described in Gateway API with OpenShift Container Platform Networking.
  • You have installed the LeaderWorkerSet Operator in OpenShift. For more information, see the OpenShift documentation.

Procedure

  1. Log in to the OpenShift console as a cluster administrator.
  2. Create a data science cluster initialization (DSCI) and set the serviceMesh.managementState to removed, as shown in the following example:

    serviceMesh:
      ...
      managementState: Removed
    Copy to Clipboard Toggle word wrap
  3. Create a data science cluster (DSC) with the following information set in kserve and serving:

    kserve:
      defaultDeploymentMode: RawDeployment
      managementState: Managed
      ...
      serving:
        ...
        managementState: Removed
        ...
    Copy to Clipboard Toggle word wrap
  4. Create the LLMInferenceService CR with the following information:

    apiVersion: serving.kserve.io/v1alpha1
    kind: LLMInferenceService
    metadata:
      name: sample-llm-inference-service
    spec:
      replicas: 2
      model:
        uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
        name: RedHatAI/Qwen3-8B-FP8-dynamic
      router:
        route: {}
        gateway: {}
        scheduler: {}
        template:
          containers:
          - name: main
            resources:
              limits:
                cpu: '4'
                memory: 32Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: '2'
                memory: 16Gi
                nvidia.com/gpu: "1"
    Copy to Clipboard Toggle word wrap

    Customize the following parameters in the spec section of the inference service:

    • replicas - Specify the number of replicas.
    • model - Provide the URI to the model based on how the model is stored (uri) and the model name to use in chat completion requests (name).

      • S3 bucket: s3://<bucket-name>/<object-key>
      • Persistent volume claim (PVC): pvc://<claim-name>/<pvc-path>
      • OCI container image: oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
      • HuggingFace: hf://<model>/<optional-hash>
    • router - Provide an HTTPRoute and gateway, or leave blank to automatically create one.
  5. Save the file.

These examples show how to use Distributed Inference with llm-d in common scenarios.

Important

Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

2.4.1.1. Single-node GPU deployment

Use single-GPU-per-replica deployment patterns for development, testing, or production deployments of smaller models, such as 7-billion-parameter models.

You can use the following examples for single-node GPU deployments:

2.4.1.2. Multi-node deployment

You can use the following examples for multi-node deployments:

You can configure the scheduler to track key-value (KV) cache blocks across inference endpoints and route requests to the endpoint with the highest cache hit rate. This configuration improves throughput and reduces latency by maximizing cache reuse.

For an example, see Precise Prefix KV Cache Routing.

You can monitor models that are deployed on the single-model serving platform to view performance and resource usage metrics.

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

  • Number of requests - The number of requests that have failed or succeeded for a specific model.
  • Average response time (ms) - The average time it takes a specific model to respond to requests.
  • CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
  • Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

  • You have installed Red Hat OpenShift AI.
  • A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
  • You have logged in to Red Hat OpenShift AI.
  • The following dashboard configuration options are set to the default values as shown:

    disablePerformanceMetrics:false
    disableKServeMetrics:false
    Copy to Clipboard Toggle word wrap

    For more information about setting dashboard configuration options, see Customizing the dashboard.

  • You have deployed a model on the single-model serving platform by using a preinstalled runtime.

    Note

    Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

  1. From the OpenShift AI dashboard navigation menu, click Data science projects.

    The Data science projects page opens.

  2. Click the name of the project that contains the data science models that you want to monitor.
  3. In the project details page, click the Models tab.
  4. Select the model that you are interested in.
  5. On the Endpoint performance tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
  6. Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

Procedure

  1. Log in to the OpenShift web console.
  2. Switch to the Developer perspective.
  3. In the left menu, click Observe.
  4. As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.

    1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:

      sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
      Copy to Clipboard Toggle word wrap
      Note

      Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.

    2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:

      sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
      Copy to Clipboard Toggle word wrap
    3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:

      sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
      Copy to Clipboard Toggle word wrap
    4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:

      sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
      Copy to Clipboard Toggle word wrap
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat