Chapter 4. Making inference requests to deployed models


When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.

If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have deployed a model by using the model serving platform.

Procedure

  1. From the OpenShift AI dashboard, click Projects.

    The Projects page opens.

  2. Click the name of the project that contains your deployed model.

    A project details page opens.

  3. Click the Deployments tab.
  4. In the Deployments list, expand the section for your model.

    Your authentication token is shown in the Token authentication section, in the Token secret field.

  5. Optional: To copy the authentication token for use in an inference request, click the Copy button ( osd copy ) next to the token value.

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have deployed a model by using the model serving platform.
  • If you enabled token authentication for your deployed model, you have the associated token value.

Procedure

  1. From the OpenShift AI dashboard, click AI hub Deployments.

    The inference endpoint for the model is shown in the Inference endpoints field.

  2. Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
  3. Use the endpoint to make API requests to your deployed model.

When you deploy a model by using the model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.

4.4. Inference endpoints

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.

4.4.1. Caikit TGIS ServingRuntime for KServe

  • :443/api/v1/task/text-generation
  • :443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
Copy to Clipboard Toggle word wrap

4.4.2. OpenVINO Model Server

  • /v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
Copy to Clipboard Toggle word wrap

4.4.3. vLLM NVIDIA GPU ServingRuntime for KServe

  • :443/version
  • :443/docs
  • :443/v1/models
  • :443/v1/chat/completions
  • :443/v1/completions
  • :443/v1/embeddings
  • :443/tokenize
  • :443/detokenize

    Note
    • The vLLM runtime is compatible with the OpenAI REST API.
    • To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
    • As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.

      containers:
        - args:
            - --chat-template=<CHAT_TEMPLATE>
      Copy to Clipboard Toggle word wrap

      You can use the chat templates that are available as .jinja files here or with the vLLM image under /app/data/template. For more information, see Chat templates.

    As indicated by the paths shown, the model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
Copy to Clipboard Toggle word wrap

See vLLM NVIDIA GPU ServingRuntime for KServe.

4.4.5. vLLM AMD GPU ServingRuntime for KServe

See vLLM NVIDIA GPU ServingRuntime for KServe.

Important

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can serve models with IBM Spyre AI accelerators on x86 by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

4.4.7. vLLM Spyre s390x ServingRuntime for KServe

You can serve models with IBM Spyre AI accelerators on IBM Z (s390x architecture) by using the vLLM Spyre s390x ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

4.4.8. NVIDIA Triton Inference Server

REST endpoints

  • v2/models/[/versions/<model_version>]/infer
  • v2/models/<model_name>[/versions/<model_version>]
  • v2/health/ready
  • v2/health/live
  • v2/models/<model_name>[/versions/]/ready
  • v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
Copy to Clipboard Toggle word wrap

gRPC endpoints

  • :443 inference.GRPCInferenceService/ModelInfer
  • :443 inference.GRPCInferenceService/ModelReady
  • :443 inference.GRPCInferenceService/ModelMetadata
  • :443 inference.GRPCInferenceService/ServerReady
  • :443 inference.GRPCInferenceService/ServerLive
  • :443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
Copy to Clipboard Toggle word wrap

4.4.9. Seldon MLServer

REST endpoints

  • v2/models/[/versions/<model_version>]/infer
  • v2/models/<model_name>[/versions/<model_version>]
  • v2/health/ready
  • v2/health/live
  • v2/models/<model_name>[/versions/]/ready
  • v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
Copy to Clipboard Toggle word wrap

gRPC endpoints

  • :443 inference.GRPCInferenceService/ModelInfer
  • :443 inference.GRPCInferenceService/ModelReady
  • :443 inference.GRPCInferenceService/ModelMetadata
  • :443 inference.GRPCInferenceService/ServerReady
  • :443 inference.GRPCInferenceService/ServerLive
  • :443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
Copy to Clipboard Toggle word wrap

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top