Chapter 3. Deploying models on the NVIDIA NIM model serving platform


You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have enabled the NVIDIA NIM model serving platform.
  • You have created a data science project.
  • You have enabled support for graphic processing units (GPUs) in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.

Procedure

  1. In the left menu, click Data science projects.

    The Data science projects page opens.

  2. Click the name of the project that you want to deploy a model in.

    A project details page opens.

  3. Click the Models tab.
  4. In the Models section, perform one of the following actions:

    • On the ​​NVIDIA NIM model serving platform tile, click Select NVIDIA NIM on the tile, and then click Deploy model.
    • If you have previously selected the NVIDIA NIM model serving type, the Models page displays NVIDIA model serving enabled on the upper-right corner, along with the Deploy model button. To proceed, click Deploy model.

    The Deploy model dialog opens.

  5. Configure properties for deploying your model as follows:

    1. In the Model deployment name field, enter a unique name for the deployment.
    2. From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy. For more information, see Supported Models
    3. In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.

      Note

      When resizing a PersistentVolumeClaim (PVC) backed by Amazon EBS in OpenShift AI, you may encounter VolumeModificationRateExceeded: You've reached the maximum modification rate per volume limit. To avoid this error, wait at least six hours between modifications per EBS volume. If you resize a PVC before the cooldown expires, the Amazon EBS CSI driver (ebs.csi.aws.com) fails with this error. This error is an Amazon EBS service limit that applies to all workloads using EBS-backed PVCs.

    4. In the Number of model server replicas to deploy field, specify a value.
    5. From the Model server size list, select a value.
  6. From the Hardware profile list, select a hardware profile.

    Important

    By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.

  7. Optional: Click Customize resource requests and limit and update the following values:

    1. In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
    2. In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
    3. In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
    4. In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
  8. Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
  9. To require token authentication for inference requests to the deployed model, perform the following actions:

    1. Select Require token authentication.
    2. In the Service account name field, enter the service account name that the token will be generated for.
    3. To add an additional service account, click Add a service account and enter another service account name.
  10. Click Deploy.

Verification

  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

3.2. Viewing NVIDIA NIM metrics for a NIM model

In OpenShift AI, you can observe the following NVIDIA NIM metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

  • GPU cache usage over time (ms)
  • Current running, waiting, and max requests count
  • Tokens count
  • Time to first token
  • Time per output token
  • Request outcomes

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

  • You have enabled the NVIDIA NIM model serving platform.
  • You have deployed a NIM model on the NVIDIA NIM model serving platform.
  • A cluster administrator has enabled metrics collection and graph generation for your deployment.
  • The disableKServeMetrics OpenShift AI dashboard configuration option is set to its default value of false:

    disableKServeMetrics: false
    Copy to Clipboard Toggle word wrap

    For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

  1. From the OpenShift AI dashboard navigation menu, click Data science projects.

    The Data science projects page opens.

  2. Click the name of the project that contains the NIM model that you want to monitor.
  3. In the project details page, click the Models tab.
  4. Click the NIM model that you want to observe.
  5. On the NIM Metrics tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
  6. Scroll down to view data graphs for NIM metrics.

Verification

The NIM Metrics tab shows graphs of NIM metrics for the deployed NIM model.

Additional resources

3.3. Viewing performance metrics for a NIM model

You can observe the following performance metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

  • Number of requests - The number of requests that have failed or succeeded for a specific model.
  • Average response time (ms) - The average time it takes a specific model to respond to requests.
  • CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
  • Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

  • You have enabled the NVIDIA NIM model serving platform.
  • You have deployed a NIM model on the NVIDIA NIM model serving platform.
  • A cluster administrator has enabled metrics collection and graph generation for your deployment.
  • The disableKServeMetrics OpenShift AI dashboard configuration option is set to its default value of false:

    disableKServeMetrics: false
    Copy to Clipboard Toggle word wrap

    For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

  1. From the OpenShift AI dashboard navigation menu, click Data science projects.

    The Data science projects page opens.

  2. Click the name of the project that contains the NIM model that you want to monitor.
  3. In the project details page, click the Models tab.
  4. Click the NIM model that you want to observe.
  5. On the Endpoint performance tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed to show the latest data. You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
  6. Scroll down to view data graphs for performance metrics.

Verification

The Endpoint performance tab shows graphs of performance metrics for the deployed NIM model.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top