Chapter 2. Managing and monitoring models on the single-model serving platform
As a cluster administrator, you can manage and monitor models on the single-model serving platform. You can configure monitoring for the single-model serving platform, deploy models across multiple GPU nodes and set up a Grafana dashboard to visualize real-time metrics, among other tasks.
2.1. Setting a timeout for KServe Copy linkLink copied to clipboard!
When deploying large models or using node autoscaling with KServe, the operation may time out before a model is deployed because the default progress-deadline that KNative Serving sets is 10 minutes.
If a pod using KNative Serving takes longer than 10 minutes to deploy, the pod might be automatically marked as failed. This can happen if you are deploying large models that take longer than 10 minutes to pull from S3-compatible object storage or if you are using node autoscaling to reduce the consumption of GPU nodes.
To resolve this issue, you can set a custom progress-deadline in the KServe InferenceService for your application.
Prerequisites
- You have namespace edit access for your OpenShift cluster.
Procedure
- Log in to the OpenShift console as a cluster administrator.
- Select the project where you have deployed the model.
-
In the Administrator perspective, click Home
Search. -
From the Resources dropdown menu, search for
InferenceService. Under
spec.predictor.annotations, modify theserving.knative.dev/progress-deadlinewith the new timeout:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteEnsure that you set the
progress-deadlineon thespec.predictor.annotationslevel, so that the KServeInferenceServicecan copy theprogress-deadlineback to the KNative Service object.
2.2. Deploying models by using multiple GPU nodes Copy linkLink copied to clipboard!
Deploy models across multiple GPU nodes to handle large models, such as large language models (LLMs).
You can serve models on Red Hat OpenShift AI across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime custom runtime, which uses the same image as the vLLM NVIDIA GPU ServingRuntime for KServe runtime and also includes information necessary for multi-GPU inferencing.
You can deploy the model from a persistent volume claim (PVC) or from an Open Container Initiative (OCI) container image.
Deploying models by using multiple GPU nodes is currently available in Red Hat OpenShift AI as a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Enabling accelerators.
-
You are using an NVIDIA GPU (
nvidia.com/gpu). -
You have specified the GPU type through either the
ServingRuntimeorInferenceService. If the GPU type specified in theServingRuntimediffers from what is set in theInferenceService, both GPU types are assigned to the resource and can cause errors.
-
You are using an NVIDIA GPU (
- You have enabled KServe on your cluster.
-
You have only one head pod in your setup. Do not adjust the replica count using the
min_replicasormax_replicassettings in theInferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster. - To deploy from a PVC: You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.
To deploy from an OCI container image:
- You have stored a model in an OCI container image.
- If the model is stored in a private OCI repository, you have configured an image pull secret.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
oc login <openshift_cluster_url> -u <admin_username> -p <password>
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Select or create a namespace for deploying the model. For example, run the following command to create the
kserve-demonamespace:oc new-project kserve-demo
oc new-project kserve-demoCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Deploying a model from a PVC only) Create a PVC for model storage in the namespace where you want to deploy the model. Create a storage class using
Filesystem volumeModeand use this storage class for your PVC. The storage size must be larger than the size of the model files on disk. For example:NoteIf you have already configured a PVC or are deploying a model from an OCI container image, you can skip this step.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pod to download the model to the PVC you created. Update the sample YAML with your bucket name, model path, and credentials:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
chmodoperation is permitted only if your pod is running as root. Remove`chmod -R 777` from the arguments if you are not running the pod as root. - 2 7
- Specify the path to the model.
- 3
- The value for
containers.image, located in yourInferenceService. To access this value, run the following command:oc get configmap inferenceservice-config -n redhat-ods-operator -oyaml | grep kserve-storage-initializer: - 4
- The access key ID to your S3 bucket.
- 5
- The secret access key to your S3 bucket.
- 6
- The name of your S3 bucket.
- 8
- The endpoint to your S3 bucket.
- 9
- The region for your S3 bucket if using an AWS S3 bucket. If using other S3-compatible storage, such as ODF or Minio, you can remove the
AWS_DEFAULT_REGIONenvironment variable. - 10
- If you encounter SSL errors, change
S3_VERIFY_SSLtofalse.
Create the
vllm-multinode-runtimecustom runtime in your project namespace:oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n kserve-demo -f -
oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n kserve-demo -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the model using the following
InferenceServiceconfiguration:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the path to your model based on your deployment method:
-
For PVC:
pvc://<pvc_name>/<model_path> -
For an OCI container image:
oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
-
For PVC:
- 2
- The following configuration can be added to the
InferenceService:-
workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value ofworkerSpec.tensorParallelSizeis at least1. workerSpec.pipelineParallelSize: Determines how many nodes are used to balance the model in deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value ofworkerSpec.pipelineParallelSizeis at least2. Do not modify this value in production environments.NoteYou may need to specify additional arguments, depending on your environment and model size.
-
Deploy the model by applying the
InferenceServiceconfiguration:oc apply -f <inference-service-file.yaml>
oc apply -f <inference-service-file.yaml>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the Ray cluster status, and send a request to the model.
Check the GPU resource status:
Retrieve the pod names for the head and worker nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Sample response
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is
0MiB.
Verify the status of your
InferenceServiceusing the following command: NOTE: In the Technology Preview, you can only use port forwarding for inferencing.oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s export MODEL_NAME=granite-8b-code-base-pvcoc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s export MODEL_NAME=granite-8b-code-base-pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Sample response
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE granite-8b-code-base-pvc http://granite-8b-code-base-pvc.default.example.com
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE granite-8b-code-base-pvc http://granite-8b-code-base-pvc.default.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Send a request to the model to confirm that the model is available for inference:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3. Configuring an inference service for Kueue Copy linkLink copied to clipboard!
To queue your inference service workloads and manage their resources, add the kueue.x-k8s.io/queue-name label to the service’s metadata. This label directs the workload to a specific LocalQueue for management and is required only if your project is enabled for Kueue. For more information, see Managing workloads with Kueue.
Prerequisites
- You have permissions to edit resources in the project where the model is deployed.
- As a cluster administrator, you have installed and activated the Red Hat build of Kueue Operator as described in Configuring workload management with Kueue.
Procedure
To configure the inference service, complete the following steps:
- Log in to the OpenShift console.
-
In the Administrator perspective, navigate to your project and locate the
InferenceServiceresource for your model. - Click the name of the InferenceService to view its details.
- Select the YAML tab to open the editor.
In the
metadatasection, add thekueue.x-k8s.io/queue-namelabel underlabels. Replace <local-queue-name> with the name of your targetLocalQueue.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Save.
Verification
-
The workload is submitted to the
LocalQueuespecified in thekueue.x-k8s.io/queue-namelabel. - The workload starts when the required cluster resources are available and admitted by the queue.
Optional: To verify, run the following command and review the
Admitted Workloadssection:oc describe localqueue <local-queue-name> -n <project-namespace>
$ oc describe localqueue <local-queue-name> -n <project-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4. Configuring an inference service for Spyre Copy linkLink copied to clipboard!
Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
If you are deploying a model using a hardware profile that relies on Spyre schedulers, you must manually edit the InferenceService YAML after deployment to add the required scheduler name and tolerations. This step is necessary because the user interface does not currently provide an option to specify a custom scheduler.
Prerequisites
- You have deployed a model on OpenShift by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime.
- You have privileges to edit resources in the project where the model is deployed.
Procedure
To configure the inference service, complete the following steps:
- Log in to the OpenShift console.
- From the perspective dropdown menu, select Administrator.
- From the Project dropdown menu, select the project where your model is deployed.
- Navigate to Home > Search.
-
From the Resources dropdown menu, select
InferenceService. -
Click the name of the
InferenceServiceresource associated with your model. - Select the YAML tab.
Edit the
spec.predictorsection to add theschedulerNameandtolerationsfields as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Save.
Verification
After you save the YAML, the existing pod for the model is terminated and a new pod is created.
- Navigate to Workloads > Pods.
- Click the new pod for your model to view its details.
- On the Details page, verify that the pod is running on a Spyre node by checking the Node information.
2.5. Optimizing performance and tuning Copy linkLink copied to clipboard!
You can optimize and tune your deployed models to balance speed, efficiency, and cost for different use cases.
To evaluate a model’s inference performance, consider these key metrics:
- Latency: The time it takes to generate a response, which is critical for real-time applications. This includes Time-to-First-Token (TTFT) and Inter-Token Latency (ITL).
- Throughput: The overall efficiency of the model server, measured in Tokens per Second (TPS) or Requests per Second (RPS).
- Cost per million tokens: The cost-effectiveness of the model’s inference.
Performance is influenced by factors like model size, available GPU memory, and input sequence length, especially for applications like text-summarization and retrieval-augmented generation (RAG). To meet your performance requirements, you can use techniques such as quantization to reduce memory needs or parallelism to distribute very large models across multiple GPUs.
2.5.1. Determining GPU requirements for LLM-powered applications Copy linkLink copied to clipboard!
There are several factors to consider when choosing GPUs for applications powered by a Large Language Model (LLM) hosted on OpenShift AI.
The following guidelines help you determine the hardware requirements for your application, depending on the size and expected usage of your model.
-
Estimating memory needs: A general rule of thumb is that a model with
Nparameters in 16-bit precision requires approximately2Nbytes of GPU memory. For example, an 8-billion-parameter model requires around 16GB of GPU memory, while a 70-billion-parameter model requires around 140GB. Quantization: To reduce memory requirements and potentially improve throughput, you can use quantization to load or run the model at lower-precision formats such as INT8, FP8, or INT4. This reduces the memory footprint at the expense of a slight reduction in model accuracy.
NoteThe vLLM ServingRuntime for KServe model-serving runtime supports several quantization methods. For more information about supported implementations and compatible hardware, see Supported hardware for quantization kernels.
- Additional memory for key-value cache: In addition to model weights, GPU memory is also needed to store the attention key-value (KV) cache, which increases with the number of requests and the sequence length of each request. This can impact performance in real-time applications, especially for larger models.
Recommended GPU configurations:
- Small Models (1B–8B parameters): For models in the range, a GPU with 24GB of memory is generally sufficient to support a small number of concurrent users.
Medium Models (10B–34B parameters):
- Models under 20B parameters require at least 48GB of GPU memory.
- Models that are between 20B - 34B parameters require at least 80GB or more of memory in a single GPU.
- Large Models (70B parameters): Models in this range may need to be distributed across multiple GPUs by using tensor parallelism techniques. Tensor parallelism allows the model to span multiple GPUs, improving inter-token latency and increasing the maximum batch size by freeing up additional memory for KV cache. Tensor parallelism works best when GPUs have fast interconnects such as an NVLink.
- Very Large Models (405B parameters): For extremely large models, quantization is recommended to reduce memory demands. You can also distribute the model using pipeline parallelism across multiple GPUs, or even across two servers. This approach allows you to scale beyond the memory limitations of a single server, but requires careful management of inter-server communication for optimal performance.
For best results, start with smaller models and then scale up to larger models as required, using techniques such as parallelism and quantization to meet your performance and memory requirements.
2.5.2. Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications Copy linkLink copied to clipboard!
There are additional factors that need to be taken into consideration for text-summarization and RAG applications, as well as for LLM-powered services that process large documents uploaded by users.
- Longer Input Sequences: The input sequence length can be significantly longer than in a typical chat application, if each user query includes a large prompt or a large amount of context such as an uploaded document. The longer input sequence length increases the prefill time, the time the model takes to process the initial input sequence before generating a response, which can then lead to a higher Time-to-First-Token (TTFT). A longer TTFT may impact the responsiveness of the application. Minimize this latency for optimal user experience.
- KV Cache Usage: Longer sequences require more GPU memory for the key-value (KV) cache. The KV cache stores intermediate attention data to improve model performance during generation. A high KV cache utilization per request requires a hardware setup with sufficient GPU memory. This is particularly crucial if multiple users are querying the model concurrently, as each request adds to the total memory load.
- Optimal Hardware Configuration: To maintain responsiveness and avoid memory bottlenecks, select a GPU configuration with sufficient memory. For instance, instead of running an 8B model on a single 24GB GPU, deploying it on a larger GPU (e.g., 48GB or 80GB) or across multiple GPUs can improve performance by providing more memory headroom for the KV cache and reducing inter-token latency. Multi-GPU setups with tensor parallelism can also help manage memory demands and improve efficiency for larger input sequences.
In summary, to ensure optimal responsiveness and scalability for document-based applications, you must prioritize hardware with high GPU memory capacity and also consider multi-GPU configurations to handle the increased memory requirements of long input sequences and KV caching.
2.5.3. Inference performance metrics Copy linkLink copied to clipboard!
Latency, throughput and cost per million tokens are key metrics to consider when evaluating the response generation efficiency of a model during inferencing. These metrics provide a comprehensive view of a model’s inference performance and can help balance speed, efficiency, and cost for different use cases.
2.5.3.1. Latency Copy linkLink copied to clipboard!
Latency is critical for interactive or real-time use cases, and is measured using the following metrics:
- Time-to-First-Token (TTFT): The delay in milliseconds between the initial request and the generation of the first token. This metric is important for streaming responses.
- Inter-Token Latency (ITL): The time taken in milliseconds to generate each subsequent token after the first, also relevant for streaming.
- Time-Per-Output-Token (TPOT): For non-streaming requests, the average time taken in milliseconds to generate each token in an output sequence.
2.5.3.2. Throughput Copy linkLink copied to clipboard!
Throughput measures the overall efficiency of a model server and is expressed with the following metrics:
- Tokens per Second (TPS): The total number of tokens generated per second across all active requests.
- Requests per Second (RPS): The number of requests processed per second. RPS, like response time, is sensitive to sequence length.
2.5.3.3. Cost per million tokens Copy linkLink copied to clipboard!
Cost per Million Tokens measures the cost-effectiveness of a model’s inference, indicating the expense incurred per million tokens generated. This metric helps to assess both the economic feasibility and scalability of deploying the model.
2.5.4. Configuring metrics-based autoscaling Copy linkLink copied to clipboard!
Metrics-based autoscaling is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Knative-based autoscaling is not available in KServe RawDeployment mode. However, you can enable metrics-based autoscaling for an inference service in this mode. Metrics-based autoscaling helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.
To set up autoscaling for your inference service in KServe RawDeployment mode, install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then use various model runtime metrics available in OpenShift Monitoring to trigger autoscaling of your inference service, such as KVCache utilization, Time to First Token (TTFT), and Concurrency.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
You have installed the CMA operator on your cluster. For more information, see Installing the custom metrics autoscaler.
Note-
You must configure the
KedaControllerresource after installing the CMA operator. -
The
odh-controllerautomatically creates theTriggerAuthentication,ServiceAccount,Role,RoleBinding, andSecretresources to allow CMA access to OpenShift Monitoring metrics.
-
You must configure the
- You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see Configuring user workload monitoring.
- You have deployed a model on the single-model serving platform in KServe RawDeployment mode.
Procedure
- Log in to the OpenShift console as a cluster administrator.
-
In the Administrator perspective, click Home
Search. - Select the project where you have deployed your model.
- From the Resources dropdown menu, select InferenceService.
-
Click the
InferenceServicefor your deployed model and then click YAML. Under
spec.predictor, define a metric-based autoscaling policy similar to the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the
vllm:num_requests_waitingmetric.- Click Save.
Verification
Confirm that the KEDA
ScaledObjectresource is created:oc get scaledobject -n <namespace>
oc get scaledobject -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.5.5. Guidelines for metrics-based autoscaling Copy linkLink copied to clipboard!
You can use metrics-based autoscaling to scale your AI workloads based on latency or throughput-focused Service Level Objectives (SLOs) as opposed to traditional request concurrency. Metrics-based autoscaling is based on Kubernetes Event-driven Autoscaling (KEDA).
Traditional scaling methods, which depend on factors such as request concurrency, request rate, or CPU utilization, are not effective for scaling LLM inference servers that operate on GPUs. In contrast, vLLM capacity is determined by the size of the GPU and the total number of tokens processed simultaneously. You can use custom metrics to help with autoscaling decisions to meet your SLOs.
The following guidelines can help you autoscale AI inference workloads, including selecting metrics, defining sliding windows, configuring HPA scale-down settings, and taking model size into account for optimal scaling performance.
2.5.5.1. Choosing metrics for latency and throughput-optimized scaling Copy linkLink copied to clipboard!
For latency-sensitive applications, choose scaling metrics depending on the characteristics of the requests:
- When sequence lengths vary, use service level objectives (SLOs) for Time to First Token (TTFT) and Inter-Token Latency (ITL). These metrics provide more scaling signals because they are less affected by changes in sequence length.
-
Use
end-to-end request latencyto trigger autoscaling when requests have similar sequence lengths.
End-to-end (e2e) request latency depends on sequence length, posing challenges for use cases with high variance in input/output token counts. A 10 token completion and a 2000 token completion will have vastly different latencies even under identical system conditions. To maximize throughput without latency constraints, use the vllm:num_requests_waiting > 0.1 metric (KEDA scaledObject does not support a threshold of 0) to scale your workloads. This metric scales up the system as soon as a request is queued, which maximizes utilization and prevents a backlog. This strategy works best when input and output sequence lengths are consistent.
To build effective metrics-based autoscaling, follow these best practices:
Select the right metrics:
- Analyze your load patterns to determine sequence length variance.
- Choose TTFT/ITL for high-variance workloads, and E2E latency for uniform workloads.
- Implement multiple metrics with different priorities for robust scaling decisions.
Progressively tune configurations:
- Start with conservative thresholds and longer windows.
- Monitor scaling behavior and SLO compliance over time.
- Optimize the configuration based on observed patterns and business needs.
Validate behavior through testing:
- Run load tests with realistic sequence length distributions.
- Validate scaling under various traffic patterns.
- Test edge cases, such as traffic spikes and gradual load increases.
2.5.5.2. Choosing the right sliding window Copy linkLink copied to clipboard!
The sliding window length is the time period over which metrics are aggregated or evaluated to make scaling decisions. The length of the sliding window length affects scaling responsiveness and stability.
The ideal window length depends on the metric you use:
- For Time to First Token (TTFT) and Inter-Token Latency (ITL) metrics, you can use shorter windows (1-2 minutes) because they are less noisy.
- For end-to-end latency metrics, you need longer windows (4-5 minutes) to account for variations in sequence length.
| Window length | Characteristics | Best for |
|---|---|---|
| Short (Less than 30 seconds) | Does not effectively trigger autoscaling if the metric scraping interval is too long. | Not recommended. |
| Medium (60 seconds) | Responds quickly to load changes, but may lead to higher costs. Can cause rapid scaling up and down, also known as thrashing. | Workloads with sharp, unpredictable spikes. |
| Long (Over 4 minutes) | Balances responsiveness and stability while reducing unnecessary scaling. Might miss brief spikes and adapt slowly to load changes. | Production workloads with moderate variability. |
2.5.5.3. Optimizing HPA scale-down configuration Copy linkLink copied to clipboard!
Effective scale-down configuration is crucial for cost optimization and resource efficiency. It requires balancing the need to quickly terminate idle pods to reduce cluster load, with the consideration of maintaining them to avoid cold startup times. The Horizontal Pod Autoscaler (HPA) configuration for scale-down plays a critical role in removing idle pods promptly and preventing unnecessary resource usage.
You can control the HPA scale-down behavior by managing the KEDA scaledObject custom resource (CR). This Custom Resource (CR) enables event-driven autoscaling for a specific workload.
To set the time that the HPA waits before scaling down, adjust the stabilizationWindowSeconds field as shown in the following example:
2.5.5.4. Considering model size for optimal scaling Copy linkLink copied to clipboard!
Model size affects autoscaling behavior and resource use. The following table describes the typical characteristics for different model sizes and describes a scaling strategy to select when implementing metrics-based autoscaling for AI inference workloads.
| Model size | Memory footprint | Scaling strategy | Cold start time |
|---|---|---|---|
| Small (Less than 3B) | Less than 6 GiB | Use aggressive scaling with lower resource buffers. | Up to 10 minutes to download and 30 seconds to load. |
| Medium (3B-10B) | 6-20 GiB | Use a more conservative scaling strategy. | Up to 30 minutes to download and 1 minute to load. |
| Large (Greater than 10B) | Greater than 20 GiB | May require model sharding or quantization. | Up to several hours to download and minutes to load. |
For models with fewer than 3 billion parameters, you can reduce cold start latency with the following strategies:
- Optimize container images by embedding models directly into the image instead of downloading them at runtime. You can also use multi-stage builds to reduce the final image size and use image layer caching for faster container pulls.
- Cache models on a Persistent Volume Claim (PVC) to share storage across replicas. Configure your inference service to use the PVC to access the cached model.
2.6. Using Grafana to monitor model performance Copy linkLink copied to clipboard!
You can deploy a Grafana metrics dashboard to monitor the performance and resource usage of your models. Metrics dashboards can help you visualize key metrics for your model-serving runtimes and hardware accelerators.
2.6.1. Deploying a Grafana metrics dashboard Copy linkLink copied to clipboard!
You can deploy a Grafana metrics dashboard for User Workload Monitoring (UWM) to monitor performance and resource usage metrics for models deployed on the single-model serving platform.
You can create a Kustomize overlay, similar to this example. Use the overlay to deploy preconfigured metrics dashboards for models deployed with OpenVino Model Server (OVMS) and vLLM.
Prerequisites
- You have cluster admin privileges for your OpenShift cluster.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
You have created an overlay to deploy a Grafana instance, similar to this example.
For vLLM deployments, see examples in Monitoring Dashboards.
NoteTo view GPU metrics, you must enable the NVIDIA GPU monitoring dashboard as described in Enabling the GPU monitoring dashboard. The GPU monitoring dashboard provides a comprehensive view of GPU utilization, memory usage, and other metrics for your GPU nodes.
Procedure
-
In a terminal window, log in to the OpenShift CLI (
oc) as a cluster administrator. - If you have not already created the overlay to install the Grafana operator and metrics dashboards, refer to the RHOAI UWM repository to create it.
Install the Grafana instance and metrics dashboards on your OpenShift cluster with the overlay that you created. Replace
<overlay-name>with the name of your overlay.oc apply -k overlays/<overlay-name>
oc apply -k overlays/<overlay-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the URL of the Grafana instance. Replace
<namespace>with the namespace that contains the Grafana instance.oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}'oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the URL to access the Grafana instance:
grafana-<namespace>.apps.example-openshift.com
grafana-<namespace>.apps.example-openshift.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
- You can access the preconfigured dashboards available for KServe, vLLM and OVMS on the Grafana instance.
2.6.2. Deploying a vLLM/GPU metrics dashboard on a Grafana instance Copy linkLink copied to clipboard!
Deploy Grafana boards to monitor accelerator and vLLM performance metrics.
Prerequisites
- You have deployed a Grafana metrics dashboard, as described in Deploying a Grafana metrics dashboard.
- You can access a Grafana instance.
-
You have installed
envsubst, a command-line tool used to substitute environment variables in configuration files. For more information, see the GNUgettextdocumentation.
Procedure
Define a
GrafanaDashboardobject in a YAML file, similar to the following examples:-
To monitor NVIDIA accelerator metrics, see
nvidia-vllm-dashboard.yaml. -
To monitor AMD accelerator metrics, see
amd-vllm-dashboard.yaml. -
To monitor Intel accelerator metrics, see
gaudi-vllm-dashboard.yaml. -
To monitor vLLM metrics, see
grafana-vllm-dashboard.yaml.
-
To monitor NVIDIA accelerator metrics, see
Create an
inputs.envfile similar to the following example. Replace theNAMESPACEandMODEL_NAMEparameters with your own values:NAMESPACE=<namespace> MODEL_NAME=<model-name>
NAMESPACE=<namespace>1 MODEL_NAME=<model-name>2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace the
NAMESPACEandMODEL_NAMEparameters in your YAML file with the values from theinputs.envfile by performing the following actions:Export the parameters described in the
inputs.envas environment variables:export $(cat inputs.env | xargs)
export $(cat inputs.env | xargs)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the following YAML file, replacing the
${NAMESPACE}and${MODEL_NAME}variables with the values of the exported environment variables, anddashboard_template.yamlwith the name of theGrafanaDashboardobject YAML file that you created earlier:envsubst '${NAMESPACE} ${MODEL_NAME}' < dashboard_template.yaml > dashboard_template-replaced.yamlenvsubst '${NAMESPACE} ${MODEL_NAME}' < dashboard_template.yaml > dashboard_template-replaced.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- Confirm that your YAML file contains updated values.
Deploy the dashboard object:
oc create -f dashboard_template-replaced.yaml
oc create -f dashboard_template-replaced.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
You can see the accelerator and vLLM metrics dashboard on your Grafana instance.
2.6.3. Grafana metrics Copy linkLink copied to clipboard!
You can use Grafana boards to monitor the accelerator and vLLM performance metrics. The datasource, instance and gpu are variables defined inside the board.
2.6.3.1. Accelerator metrics Copy linkLink copied to clipboard!
Track metrics on your accelerators to ensure the health of the hardware.
- NVIDIA GPU utilization
Tracks the percentage of time the GPU is actively processing tasks, indicating GPU workload levels.
Query
DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}
DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}
- NVIDIA GPU memory utilization
Compares memory usage against free memory, which is critical for identifying memory bottlenecks in GPU-heavy workloads.
Query
DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}
DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}
Sum
sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})
sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})
- NVIDIA GPU temperature
Ensures the GPU operates within safe thermal limits to prevent hardware degradation.
Query
DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}
DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}
Avg
avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})
avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})
- NVIDIA GPU throttling
GPU throttling occurs when the GPU automatically reduces the clock to avoid damage from overheating.
You can access the following metrics to identify GPU throttling:
- GPU temperature: Monitor the GPU temperature. Throttling often occurs when the GPU reaches a certain temperature, for example, 85-90°C.
- SM clock speed: Monitor the core clock speed. A significant drop in the clock speed while the GPU is under load indicates throttling.
2.6.3.2. CPU metrics Copy linkLink copied to clipboard!
You can track metrics on your CPU to ensure the health of the hardware.
- CPU utilization
Tracks CPU usage to identify workloads that are CPU-bound.
Query
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
- CPU-GPU bottlenecks
A combination of CPU throttling and GPU usage metrics to identify resource allocation inefficiencies. The following table outlines the combination of CPU throttling and GPU utilizations, and what these metrics mean for your environment:
| CPU throttling | GPU utilization | Meaning |
|---|---|---|
| Low | High | System well-balanced. GPU is fully used without CPU constraints. |
| High | Low | CPU resources are constrained. The CPU is unable to keep up with the GPU’s processing demands, and the GPU may be underused. |
| High | High | Workload is increasing for both CPU and GPU, and you might need to scale up resources. |
Query
sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])
sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])
2.6.3.3. vLLM metrics Copy linkLink copied to clipboard!
You can track metrics related to your vLLM model.
- GPU and CPU cache utilization
Tracks the percentage of GPU memory used by the vLLM model, providing insights into memory efficiency.
Query
sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])
sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])
- Running requests
The number of requests actively being processed. Helps monitor workload concurrency.
num_requests_running{namespace="$namespace", pod=~"$model_name.*"}
num_requests_running{namespace="$namespace", pod=~"$model_name.*"}
- Waiting requests
Tracks requests in the queue, indicating system saturation.
num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}
num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}
- Prefix cache hit rates
High hit rates imply efficient reuse of cached computations, optimizing resource usage.
Queries
vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
- Request total count
Query
vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
The request ended because it reached the maximum token limit set for the model inference.
Query
vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
The request completed naturally based on the model’s output or a stop condition, for example, the end of a sentence or token completion.
- End-to-end latency
- Measures the overall time to process a request for an optimal user experience.
Histogram queries
- Time to first token (TTFT) latency
The time taken to generate the first token in a response.
Histogram queries
- Time per output token (TPOT) latency
The average time taken to generate each output token.
Histogram queries
- Prompt token throughput and generation throughput
Tracks the speed of processing prompt tokens for LLM optimization.
Queries
rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
- Total tokens generated
- Measures the efficiency of generating response tokens, critical for real-time applications.
Query
sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})
sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})