Chapter 5. Making inference requests to deployed models


When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.

If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have deployed a model by using the single-model serving platform.

Procedure

  1. From the OpenShift AI dashboard, click Data science projects.

    The Data science projects page opens.

  2. Click the name of the project that contains your deployed model.

    A project details page opens.

  3. Click the Models tab.
  4. In the Models and model servers list, expand the section for your model.

    Your authentication token is shown in the Token authentication section, in the Token secret field.

  5. Optional: To copy the authentication token for use in an inference request, click the Copy button ( osd copy ) next to the token value.

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You have deployed a model by using the single-model serving platform.
  • If you enabled token authentication for your deployed model, you have the associated token value.

Procedure

  1. From the OpenShift AI dashboard, click Models Model deployments.

    The inference endpoint for the model is shown in the Inference endpoint field.

  2. Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
  3. Use the endpoint to make API requests to your deployed model.

When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.

5.4. Inference endpoints

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.

5.4.1. Caikit TGIS ServingRuntime for KServe

  • :443/api/v1/task/text-generation
  • :443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

5.4.2. Caikit Standalone ServingRuntime for KServe

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

  • /api/v1/task/embedding
  • /api/v1/task/embedding-tasks
  • /api/v1/task/sentence-similarity
  • /api/v1/task/sentence-similarity-tasks
  • /api/v1/task/rerank
  • /api/v1/task/rerank-tasks
  • /info/models
  • /info/version
  • /info/runtime

gRPC endpoints

  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
  • :443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
  • :443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
  • :443 caikit.runtime.info.InfoService/GetModelsInfo
  • :443 caikit.runtime.info.InfoService/GetRuntimeInfo
Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

Example command

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

5.4.3. TGIS Standalone ServingRuntime for KServe

Important

The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see OpenShift AI release notes.

  • :443 fmaas.GenerationService/Generate
  • :443 fmaas.GenerationService/GenerateStream

    Note

    To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI text-generation-inference repository.

Example command

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

5.4.4. OpenVINO Model Server

  • /v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe

  • :443/version
  • :443/docs
  • :443/v1/models
  • :443/v1/chat/completions
  • :443/v1/completions
  • :443/v1/embeddings
  • :443/tokenize
  • :443/detokenize

    Note
    • The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
    • To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
    • As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.

      containers:
        - args:
            - --chat-template=<CHAT_TEMPLATE>

      You can use the chat templates that are available as .jinja files here or with the vLLM image under /app/data/template. For more information, see Chat templates.

    As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

See vLLM NVIDIA GPU ServingRuntime for KServe.

5.4.7. vLLM AMD GPU ServingRuntime for KServe

See vLLM NVIDIA GPU ServingRuntime for KServe.

Important

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

See vLLM NVIDIA GPU ServingRuntime for KServe.

5.4.9. NVIDIA Triton Inference Server

REST endpoints

  • v2/models/[/versions/<model_version>]/infer
  • v2/models/<model_name>[/versions/<model_version>]
  • v2/health/ready
  • v2/health/live
  • v2/models/<model_name>[/versions/]/ready
  • v2
Note

ModelMesh does not support the following REST endpoints:

  • v2/health/live
  • v2/health/ready
  • v2/models/<model_name>[/versions/]/ready

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

  • :443 inference.GRPCInferenceService/ModelInfer
  • :443 inference.GRPCInferenceService/ModelReady
  • :443 inference.GRPCInferenceService/ModelMetadata
  • :443 inference.GRPCInferenceService/ServerReady
  • :443 inference.GRPCInferenceService/ServerLive
  • :443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

5.4.10. Seldon MLServer

REST endpoints

  • v2/models/[/versions/<model_version>]/infer
  • v2/models/<model_name>[/versions/<model_version>]
  • v2/health/ready
  • v2/health/live
  • v2/models/<model_name>[/versions/]/ready
  • v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

  • :443 inference.GRPCInferenceService/ModelInfer
  • :443 inference.GRPCInferenceService/ModelReady
  • :443 inference.GRPCInferenceService/ModelMetadata
  • :443 inference.GRPCInferenceService/ServerReady
  • :443 inference.GRPCInferenceService/ServerLive
  • :443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat Documentation

Legal Notice

Theme

© 2026 Red Hat
Back to top