Search

Chapter 3. Serving large models

download PDF

For deploying large models such as large language models (LLMs), Red Hat OpenShift AI includes a single model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

3.1. About the single-model serving platform

For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

3.1.1. Components

  • KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
  • Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
  • Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.

3.1.2. Installation options

To install the single-model serving platform, you have the following options:

Automated installation

If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.

For more information about automated installation, see Configuring automated installation of KServe.

Manual installation

If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

For more information about manual installation, see Manually installing KServe.

3.1.3. Authorization

You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authorization for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.

To add Authorino as an authorization provider on the single-model serving platform, you have the following options:

  • If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
  • If you need to manually install the single-model serving platform, you must also manually configure Authorino.

For guidance on choosing an installation option for the single-model serving platform, see Installation options.

3.1.4. Monitoring

You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.

3.1.5. Supported model-serving runtimes

OpenShift AI includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Table 3.1. Model-serving runtimes
NameDescriptionExported model format

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe (1)

A composite runtime for serving models in the Caikit format

Caikit Text Generation

Caikit Standalone ServingRuntime for KServe (2)

A runtime for serving models in the Caikit embeddings format for embeddings tasks

Caikit Embeddings

OpenVINO Model Server

A scalable, high-performance runtime for serving models that are optimized for Intel architectures

PyTorch, TensorFlow, OpenVINO IR, PaddlePaddle, MXNet, Caffe, Kaldi

Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe (3)

A runtime for serving TGI-enabled models

PyTorch Model Formats

vLLM ServingRuntime for KServe

A high-throughput and memory-efficient inference and serving runtime for large language models

Supported models

  1. The composite Caikit-TGIS runtime is based on Caikit and Text Generation Inference Server (TGIS). To use this runtime, you must convert your models to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
  2. The Caikit Standalone runtime is based on Caikit NLP. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see Tests for text embedding module.
  3. Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model is incompatible in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
Table 3.2. Deployment requirements
NameDefault protocolAdditonal protocolModel mesh supportSingle node OpenShift supportDeployment mode

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe

REST

gRPC

No

Yes

Raw and serverless

Caikit Standalone ServingRuntime for KServe

REST

gRPC

No

Yes

Raw and serverless

OpenVINO Model Server

REST

None

Yes

Yes

Raw and serverless

Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe

gRPC

None

No

Yes

Raw and serverless

vLLM ServingRuntime for KServe

REST

None

No

Yes

Raw and serverless

Additional resources

3.1.6. Inference endpoints

These examples show how to use inference endpoints to query the model.

Caikit TGIS ServingRuntime for KServe
  • :443/api/v1/task/text-generation
  • :443/api/v1/task/server-streaming-text-generation
Caikit Standalone ServingRuntime for KServe

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

  • /api/v1/task/embedding
  • /api/v1/task/embedding-tasks
  • /api/v1/task/sentence-similarity
  • /api/v1/task/sentence-similarity-tasks
  • /api/v1/task/rerank
  • /api/v1/task/rerank-tasks
  • /info/models
  • /info/version
  • /info/runtime

gRPC endpoints

  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
  • :443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
  • :443 caikit.runtime.info.InfoService/GetModelsInfo
  • :443 caikit.runtime.info.InfoService/GetRuntimeInfo
Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

TGIS Standalone ServingRuntime for KServe
  • :443 fmaas.GenerationService/Generate
  • :443 fmaas.GenerationService/GenerateStream

    Note

    To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI text-generation-inference repository.

OpenVINO Model Server
  • /v2/models/<model-name>/infer
vLLM ServingRuntime for KServe
  • :443/version
  • :443/docs
  • :443/v1/models
  • :443/v1/chat/completions
  • :443/v1/completions
  • :443/v1/embeddings
  • :443/tokenize
  • :443/detokenize

    Note
    • The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
    • To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
    • As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.

      containers:
        - args:
            - --chat-template=<CHAT_TEMPLATE>

      You can use the chat templates that are available as .jinja files here or with the vLLM image under /apps/data/template. For more information, see Chat templates.

    As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

3.1.6.1. Example commands

Note

If you enabled token authorization when deploying the model, add the Authorization header and specify a token value.

Caikit TGIS ServingRuntime for KServe

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

Caikit Standalone ServingRuntime for KServe

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

TGIS Standalone ServingRuntime for KServe

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

OpenVINO Model Server

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

vLLM ServingRuntime for KServe

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

3.1.6.2. Additional resources

3.2. About KServe deployment modes

By default, you can deploy models on the single-model serving platform with KServe by using Red Hat OpenShift Serverless, which is a cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project. In addition, serverless mode is dependent on the Red Hat OpenShift Serverless Operator.

Alternatively, you can use raw deployment mode, which is not dependent on the Red Hat OpenShift Serverless Operator. With raw deployment mode, you can deploy models with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler.

Important

Deploying a machine learning model using KServe raw deployment mode is a Limited Availability feature. Limited Availability means that you can install and receive support for the feature only with specific approval from the Red Hat AI Business Unit. Without such approval, the feature is unsupported. In addition, this feature is only supported on Self-Managed deployments of single node OpenShift.

There are both advantages and disadvantages to using each of these deployment modes:

3.2.1. Serverless mode

Advantages:

  • Enables autoscaling based on request volume:

    • Resources scale up automatically when receiving incoming requests.
    • Optimizes resource usage and maintains performance during peak times.
  • Supports scale down to and from zero using Knative:

    • Allows resources to scale down completely when there are no incoming requests.
    • Saves costs by not running idle resources.

Disadvantages:

  • Has customization limitations:

    • Serverless is limited to Knative, such as when mounting multiple volumes.
  • Dependency on Knative for scaling:

    • Introduces additional complexity in setup and management compared to traditional scaling methods.

3.2.2. Raw deployment mode

Advantages:

  • Enables deployment with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler:

    • Provides full control over Kubernetes resources, allowing for detailed customization and configuration of deployment settings.
  • Unlocks Knative limitations, such as being unable to mount multiple volumes:

    • Beneficial for applications requiring complex configurations or multiple storage mounts.

Disadvantages:

  • Does not support automatic scaling:

    • Does not support automatic scaling down to zero resources when idle.
    • Might result in higher costs during periods of low traffic.
  • Requires manual management of scaling.

3.3. Deploying models by using the single-model serving platform

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Working with certificates.

3.3.1. Enabling the single-model serving platform

When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the admin group (for example, rhoai-admins) in OpenShift.
  • You have installed KServe.
  • Your cluster administrator has not edited the OpenShift AI dashboard configuration to disable the ability to select the single-model serving platform, which uses the KServe component. For more information, see Dashboard configuration options.

Procedure

  1. Enable the single-model serving platform as follows:

    1. In the left menu, click Settings Cluster settings.
    2. Locate the Model serving platforms section.
    3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
    4. Click Save changes.
  2. Enable preinstalled runtimes for the single-model serving platform as follows:

    1. In the left menu of the OpenShift AI dashboard, click Settings Serving runtimes.

      The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.

      For more information about preinstalled runtimes, see Supported runtimes.

    2. Set the runtime that you want to use to Enabled.

      The single-model serving platform is now available for model deployments.

3.3.2. Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the pre-installed runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs. For example, if the TGIS runtime does not support a model format that is supported by Hugging Face Text Generation Inference (TGI), you can create a custom runtime to add support for the model.

As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note

Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

  • You have logged in to OpenShift AI as an administrator.
  • You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

  1. From the OpenShift AI dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example, TGIS Standalone ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
    • To add a new custom runtime, click Add serving runtime.
  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
  4. In the Select the API protocol this runtime supports list, select REST or gRPC.
  5. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.
      2. In the file browser, select a YAML file on your computer.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.
      2. Enter or paste YAML code directly in the embedded editor.
    Note

    In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.

  6. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  7. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

3.3.3. Deploying models on the single-model serving platform

When you have enabled the single-model serving platform, you can enable a pre-installed or custom model-serving runtime and start to deploy models on the platform.

Note

Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • You have installed KServe.
  • You have enabled the single-model serving platform.
  • You have created a data science project.
  • You have access to S3-compatible object storage.
  • For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
  • To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
  • If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. See Enabling NVIDIA GPUs.
  • To use the vLLM runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs
Note

In OpenShift AI 1, Red Hat supports only NVIDIA GPU accelerators for model serving.

  • To deploy RHEL AI models:

    • You have enabled the vLLM runtime.
    • You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.

Procedure

  1. In the left menu, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that you want to deploy a model in.

    A project details page opens.

  3. Click the Models tab.
  4. Perform one of the following actions:

    • If you see a ​​Single-model serving platform tile, click Deploy model on the tile.
    • If you do not see any tiles, click the Deploy model button.

    The Deploy model dialog opens.

  5. In the Model name field, enter a unique name for the model that you are deploying.
  6. In the Serving runtime field, select an enabled runtime.
  7. From the Model framework list, select a value.
  8. In the Number of model replicas to deploy field, specify a value.
  9. From the Model server size list, select a value.
  10. Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
  11. To require token authorization for inference requests to the deployed model, perform the following actions:

    1. Select Require token authorization.
    2. In the Service account name field, enter the service account name that the token will be generated for.
  12. To specify the location of your model, perform one of the following sets of actions:

    • To use an existing data connection

      1. Select Existing data connection.
      2. From the Name list, select a data connection that you previously defined.
      3. In the Path field, enter the folder path that contains the model in your specified data source.

        Important

        The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.

    • To use a new data connection

      1. To define a new data connection that your model can access, select New data connection.
      2. In the Name field, enter a unique name for the data connection.
      3. In the Access key field, enter the access key ID for your S3-compatible object storage provider.
      4. In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
      5. In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
      6. In the Region field, enter the default region of your S3-compatible object storage account.
      7. In the Bucket field, enter the name of your S3-compatible object storage bucket.
      8. In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.

        Important

        The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.

  13. Click Deploy.

Verification

  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

3.4. Making inference requests to models deployed on the single-model serving platform

When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify this in your inference requests.

3.4.1. Accessing the authorization token for a deployed model

If you secured your model inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify it in your inference requests.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • You have deployed a model by using the single-model serving platform.

Procedure

  1. From the OpenShift AI dashboard, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that contains your deployed model.

    A project details page opens.

  3. Click the Models tab.
  4. In the Models and model servers list, expand the section for your model.

    Your authorization token is shown in the Token authorization section, in the Token secret field.

  5. Optional: To copy the authorization token for use in an inference request, click the Copy button ( osd copy ) next to the token value.

3.4.2. Accessing the inference endpoint for a deployed model

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • You have deployed a model by using the single-model serving platform.
  • If you enabled token authorization for your deployed model, you have the associated token value.

Procedure

  1. From the OpenShift AI dashboard, click Model Serving.

    The inference endpoint for the model is shown in the Inference endpoint field.

  2. Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
  3. Use the endpoint to make API requests to your deployed model.

3.5. Viewing model-serving runtime metrics for the single-model serving platform

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Prerequisites

Procedure

  1. Log in to the OpenShift web console.
  2. Switch to the Developer perspective.
  3. In the left menu, click Observe.
  4. As described in Querying metrics for user-defined projects as a developer (Red Hat OpenShift Dedicated) or Querying metrics for user-defined projects as a developer (Red Hat OpenShift Service on AWS), use the web console to run queries for caikit_*, tgi_*, ovms_* and vllm:* model-serving runtime metrics. You can also run queries for istio_* metrics that are related to OpenShift Service Mesh. Some examples are shown.

    1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:

      sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
    2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:

      sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
    3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:

      sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
    4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:

      sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))

Additional resources

3.6. Monitoring model performance

In the single-model serving platform, you can view performance metrics for a specific model that is deployed on the platform.

3.6.1. Viewing performance metrics for a deployed model

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

  • Number of requests - The number of requests that have failed or succeeded for a specific model.
  • Average response time (ms) - The average time it takes a specific model to respond to requests.
  • CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
  • Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

  • You have installed Red Hat OpenShift AI.
  • You have logged in to OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • The following dashboard configuration options are set to the default values as shown:

    disablePerformanceMetrics:false
    disableKServeMetrics:false

    For more information, see Dashboard configuration options.

  • You have deployed a model on the single-model serving platform by using a preinstalled runtime.

    Note

    Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

  1. From the OpenShift AI dashboard navigation menu, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that contains the data science models that you want to monitor.
  3. In the project details page, click the Models tab.
  4. Select the model that you are interested in.
  5. On the Endpoint performance tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
  6. Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

3.7. Optimizing model-serving runtimes

You can optionally enhance the preinstalled model-serving runtimes available in OpenShift AI to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

3.7.1. Optimizing the vLLM model-serving runtime

You can configure the vLLM ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

To configure the vLLM ServingRuntime for KServe runtime for speculative decoding or multi-modal inferencing, you must add additional arguments in the vLLM model-serving runtime.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the admin group (for example, oai-admin-group) in OpenShift.
  • If you used the pre-installed vLLM ServingRuntime for KServe runtime, you duplicated the runtime to create a custom version. For more information about duplicating the pre-installed vLLM runtime, see Adding a custom model-serving runtime for the single-model serving platform.
  • If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

  1. From the OpenShift AI dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Find the custom vLLM model-serving runtime that you created, click the action menu (⋮) next to the runtime and select Edit.

    The embedded YAML editor opens and shows the contents of the custom model-serving runtime.

  3. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt:

    1. Add the following arguments:

      containers:
        - args:
            - --speculative-model=[ngram]
            - --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
            - --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
            - --use-v2-block-manager
    2. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.

      Note

      Inferencing throughput varies depending on the model used for speculating with n-grams.

  4. To configure the vLLM model-serving runtime for speculative decoding with a draft model:

    1. Remove the --model argument:

      containers:
        - args:
            - --model=/mnt/models
    2. Add the following arguments:

      containers:
        - args:
            - --port=8080
            - --served-model-name={{.Name}}
            - --distributed-executor-backend=mp
            - --model=/mnt/models/<path_to_original_model>
            - --speculative-model=/mnt/models/<path_to_speculative_model>
            - --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
            - --use-v2-block-manager
    3. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
    4. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
  5. To configure the vLLM model-serving runtime for multi-modal inferencing:

    1. Add the following arguments:

      containers:
        - args:
            - --trust-remote-code
      Note

      Only use the --trust-remote-code argument with models from trusted sources.

  6. Click Update.

    The Serving runtimes page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown.

  7. Deploy the model by using the custom runtime as described in Deploying models on the single-model serving platform.

Verification

  • If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
  • If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    -d '{"model":"<model_name>",
         "messages":
            [{"role":"<role>",
              "content":
                 [{"type":"text", "text":"<text>"
                  },
                  {"type":"image_url", "image_url":"<image_url_link>"
                  }
                 ]
             }
            ]
        }'

3.8. Performance tuning on the single-model serving platform

Certain performance issues might require you to tune the parameters of your inference service or model-serving runtime.

3.8.1. Resolving CUDA out-of-memory errors

In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the admin group (for example, rhoai-admins) in OpenShift.

Procedure

  1. From the OpenShift AI dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Based on the runtime that you used to deploy your model, perform one of the following actions:

    • If you used the pre-installed TGIS Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.
    • If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.

      The embedded YAML editor opens and shows the contents of the custom model-serving runtime.

  3. Add or update the BATCH_SAFETY_MARGIN environment variable and set the value to 30. Similarly, add or update the ESTIMATE_MEMORY_BATCH_SIZE environment variable and set the value to 8.

    spec:
      containers:
        env:
        - name: BATCH_SAFETY_MARGIN
          value: 30
        - name: ESTIMATE_MEMORY_BATCH
          value: 8
    Note

    The BATCH_SAFETY_MARGIN parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value of BATCH_SAFETY_MARGIN is 20. The ESTIMATE_MEMORY_BATCH_SIZE parameter sets the batch size used in the memory auto-tuning algorithm. The default value of ESTIMATE_MEMORY_BATCH_SIZE is 16.

  4. Click Update.

    The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.

  5. To redeploy the model for the parameter updates to take effect, perform the following actions:

    1. From the OpenShift AI dashboard, click Model Serving > Deployed Models.
    2. Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.
    3. Redeploy the model as described in Deploying models on the single-model serving platform.

Verification

  • You receive successful responses from the model server and no longer see CUDA OOM errors.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.