Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 1. About model-serving platforms


As an OpenShift AI administrator, you can enable your preferred serving platform and make it available for serving models. You can also add a custom or a tested and verified model-serving runtime.

1.1. About model serving

When you serve a model, you upload a trained model into Red Hat OpenShift AI for querying, which allows you to integrate your trained models into intelligent applications.

You can upload a model to an S3-compatible object storage, persistent volume claim, or Open Container Initiative (OCI) image. You can then access and train the model from your project workbench. After training the model, you can serve or deploy the model using a model-serving platform.

Serving or deploying the model makes the model available as a service, or model runtime server, that you can access using an API. You can then access the inference endpoints for the deployed model from the dashboard and see predictions based on data inputs that you provide through API calls. Querying the model through the API is also called model inferencing.

You can serve models on one of the following model-serving platforms:

  • Single-model serving platform
  • Multi-model serving platform
  • NVIDIA NIM model serving platform

The model-serving platform that you choose depends on your business needs:

  • If you want to deploy each model on its own runtime server, or want to use a serverless deployment, select the single-model serving platform. The single-model serving platform is recommended for production use.
  • If you want to deploy multiple models with only one runtime server, select the multi-model serving platform. This option is best if you are deploying more than 1,000 small and medium models and want to reduce resource consumption.
  • If you want to use NVIDIA Inference Microservices (NIMs) to deploy a model, select the NVIDIA NIM-model serving platform.

1.1.1. Single-model serving platform

You can deploy each model from a dedicated model server on the single-model serving platform. Deploying models from a dedicated model server can help you deploy, monitor, scale, and maintain models that require increased resources. This model serving platform is ideal for serving large models. The single-model serving platform is based on the KServe component.

The single-model serving platform is helpful for use cases such as:

  • Large language models (LLMs)
  • Generative AI

For more information about setting up the single-model serving platform, see Installing the single-model serving platform.

1.1.2. Multi-model serving platform

You can deploy multiple models from the same model server on the multi-model serving platform. Each of the deployed models shares the server resources. Deploying multiple models from the same model server can be advantageous on OpenShift clusters that have finite compute resources or pods. This model serving platform is ideal for serving small and medium models in large quantities. The multi-model serving platform is based on the ModelMesh component.

For more information about setting up the multi-model serving platform, see Installing the multi-model serving platform.

1.1.3. NVIDIA NIM model serving platform

You can deploy models using NVIDIA Inference Microservices (NIM) on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

NVIDIA NIM inference services are helpful for use cases such as:

  • Using GPU-accelerated containers inferencing models optimized by NVIDIA
  • Deploying generative AI for virtual screening, content generation, and avatar creation

The NVIDIA NIM model serving platform is based on the single-model serving platform. To use the NVIDIA NIM model serving platform, you must first install the single-model serving platform.

For more information, see Installing the single-model serving platform.

1.2. Model-serving runtimes

You can serve models on the single-model serving platform by using model-serving runtimes. The configuration of a model-serving runtime is defined by the ServingRuntime and InferenceService custom resource definitions (CRDs).

1.2.1. ServingRuntime

The ServingRuntime CRD creates a serving runtime, an environment for deploying and managing a model. It creates the templates for pods that dynamically load and unload models of various formats and also exposes a service endpoint for inferencing requests.

The following YAML configuration is an example of the vLLM ServingRuntime for KServe model-serving runtime. The configuration includes various flags, environment variables and command-line arguments.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' 
1

    openshift.io/display-name: vLLM ServingRuntime for KServe 
2

  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
  namespace: <namespace>
spec:
  annotations:
    prometheus.io/path: /metrics 
3

    prometheus.io/port: "8080" 
4

  containers:
    - args:
        - --port=8080
        - --model=/mnt/models 
5

        - --served-model-name={{.Name}} 
6

      command: 
7

        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      image: quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21 
8

      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
  multiModel: false 
9

  supportedModelFormats: 
10

    - autoSelect: true
      name: vLLM
1
The recommended accelerator to use with the runtime.
2
The name with which the serving runtime is displayed.
3
The endpoint used by Prometheus to scrape metrics for monitoring.
4
The port used by Prometheus to scrape metrics for monitoring.
5
The path to where the model files are stored in the runtime container.
6
Passes the model name that is specified by the {{.Name}} template variable inside the runtime container specification to the runtime environment. The {{.Name}} variable maps to the spec.predictor.name field in the InferenceService metadata object.
7
The entrypoint command that starts the runtime container.
8
The runtime container image used by the serving runtime. This image differs depending on the type of accelerator used.
9
Specifies that the runtime is used for single-model serving.
10
Specifies the model formats supported by the runtime.

1.2.2. InferenceService

The InferenceService CRD creates a server or inference service that processes inference queries, passes it to the model, and then returns the inference output.

The inference service also performs the following actions:

  • Specifies the location and format of the model.
  • Specifies the serving runtime used to serve the model.
  • Enables the passthrough route for gRPC or REST inference.
  • Defines HTTP or gRPC endpoints for the deployed model.

The following example shows the InferenceService YAML configuration file that is generated when deploying a granite model with the vLLM runtime:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vllm-runtime
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

1.2.3. Model-serving runtimes for accelerators

OpenShift AI provides support for accelerators through preinstalled model-serving runtimes.

1.2.3.1. NVIDIA GPUs

You can serve models with NVIDIA graphics processing units (GPUs) by using the vLLM NVIDIA GPU ServingRuntime for KServe runtime. To use the runtime, you must enable GPU support in OpenShift AI. This includes installing and configuring the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.

1.2.3.2. Intel Gaudi accelerators

You can serve models with Intel Gaudi accelerators by using the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime. To use the runtime, you must enable hybrid processing (HPU) support in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation and Working with hardware profiles.

For information about recommended vLLM parameters, environment variables, supported configurations and more, see vLLM with Intel® Gaudi® AI Accelerators.

Note

Warm-up is a model initialization and performance optimization step that is useful for reducing cold-start delays and first-inference latency. Depending on the model size, warm-up can lead to longer model loading times.

While highly recommended in production environments to avoid performance limitations, you can choose to skip warm-up for non-production environments to reduce model loading times and accelerate model development and testing cycles. To skip warm-up, follow the steps described in Customizing the parameters of a deployed model-serving runtime to add the following environment variable in the Configuration parameters section of your model deployment:

`VLLM_SKIP_WARMUP="true"`

1.2.3.3. AMD GPUs

You can serve models with AMD GPUs by using the vLLM AMD GPU ServingRuntime for KServe runtime. To use the runtime, you must enable support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift in the AMD documentation and Working with hardware profiles.

1.2.3.4. IBM Spyre AI accelerators on x86

Important

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can serve models with IBM Spyre AI accelerators on x86 by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

1.2.4. Supported model-serving runtimes

OpenShift AI includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

See Supported configurations for a list of the supported model-serving runtimes and deployment requirements.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

1.2.5. Tested and verified model-serving runtimes

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of OpenShift AI.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of OpenShift AI. If a new version of a tested and verified runtime is released in the middle of an OpenShift AI release cycle, it will be tested and verified in an upcoming release.

See Supported configurations for a list of tested and verified runtimes in OpenShift AI.

Note

Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

For more information, see Tested and verified runtimes in OpenShift AI.

Additional resources

Red Hat logoGithubredditYoutubeTwitter

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Wir helfen Red Hat Benutzern, mit unseren Produkten und Diensten innovativ zu sein und ihre Ziele zu erreichen – mit Inhalten, denen sie vertrauen können. Entdecken Sie unsere neuesten Updates.

Mehr Inklusion in Open Source

Red Hat hat sich verpflichtet, problematische Sprache in unserem Code, unserer Dokumentation und unseren Web-Eigenschaften zu ersetzen. Weitere Einzelheiten finden Sie in Red Hat Blog.

Über Red Hat

Wir liefern gehärtete Lösungen, die es Unternehmen leichter machen, plattform- und umgebungsübergreifend zu arbeiten, vom zentralen Rechenzentrum bis zum Netzwerkrand.

Theme

© 2026 Red Hat
Nach oben