Chapter 5. Making inference requests to deployed models

When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.

5.1. Accessing the authentication token for a deployed model
Copy link

If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have deployed a model by using the single-model serving platform.

Procedure

From the OpenShift AI dashboard, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
Click the Models tab.
In the Models and model servers list, expand the section for your model.
Your authentication token is shown in the Token authentication section, in the Token secret field.
Optional: To copy the authentication token for use in an inference request, click the Copy button ( ) next to the token value.

5.2. Accessing the inference endpoint for a deployed model
Copy link

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have deployed a model by using the single-model serving platform.
If you enabled token authentication for your deployed model, you have the associated token value.

Procedure

From the OpenShift AI dashboard, click Models Model deployments.
The inference endpoint for the model is shown in the Inference endpoint field.
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
Use the endpoint to make API requests to your deployed model.

5.3. Making inference requests to models deployed on the single-model serving platform
Copy link

When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.

In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.

5.4. Inference endpoints
Copy link

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.

5.4.1. Caikit TGIS ServingRuntime for KServe
Copy link

:443/api/v1/task/text-generation
:443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

5.4.2. Caikit Standalone ServingRuntime for KServe
Copy link

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

/api/v1/task/embedding
/api/v1/task/embedding-tasks
/api/v1/task/sentence-similarity
/api/v1/task/sentence-similarity-tasks
/api/v1/task/rerank
/api/v1/task/rerank-tasks
/info/models
/info/version
/info/runtime

gRPC endpoints

:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
:443 caikit.runtime.info.InfoService/GetModelsInfo
:443 caikit.runtime.info.InfoService/GetRuntimeInfo

Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

Example command

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

5.4.3. TGIS Standalone ServingRuntime for KServe
Copy link

Important

The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see OpenShift AI release notes.

:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream
Note
To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI text-generation-inference repository.

Example command

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

5.4.4. OpenVINO Model Server
Copy link

/v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe
Copy link

:443/version
:443/docs
:443/v1/models
:443/v1/chat/completions
:443/v1/completions
:443/v1/embeddings
:443/tokenize
:443/detokenize
Note
- The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
- To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
- As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.
  
  containers: - args: - --chat-template=<CHAT_TEMPLATE>
  
  You can use the chat templates that are available as .jinja files here or with the vLLM image under /app/data/template. For more information, see Chat templates.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
Copy link

See vLLM NVIDIA GPU ServingRuntime for KServe.

5.4.7. vLLM AMD GPU ServingRuntime for KServe
Copy link

See vLLM NVIDIA GPU ServingRuntime for KServe.

5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe
Copy link

Important

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

See vLLM NVIDIA GPU ServingRuntime for KServe.

5.4.9. NVIDIA Triton Inference Server
Copy link

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Note

ModelMesh does not support the following REST endpoints:

v2/health/live
v2/health/ready
v2/models/<model_name>[/versions/]/ready

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

5.4.10. Seldon MLServer
Copy link

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Chapter 5. Making inference requests to deployed models

5.1. Accessing the authentication token for a deployed model
Copy link

5.2. Accessing the inference endpoint for a deployed model
Copy link

5.3. Making inference requests to models deployed on the single-model serving platform
Copy link

5.4. Inference endpoints
Copy link

5.4.1. Caikit TGIS ServingRuntime for KServe
Copy link

5.4.2. Caikit Standalone ServingRuntime for KServe
Copy link

5.4.3. TGIS Standalone ServingRuntime for KServe
Copy link

5.4.4. OpenVINO Model Server
Copy link

5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe
Copy link

5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
Copy link

5.4.7. vLLM AMD GPU ServingRuntime for KServe
Copy link

5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe
Copy link

5.4.9. NVIDIA Triton Inference Server
Copy link

5.4.10. Seldon MLServer
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 5. Making inference requests to deployed models

5.1. Accessing the authentication token for a deployed modelCopy linkLink copied to clipboard!

5.2. Accessing the inference endpoint for a deployed modelCopy linkLink copied to clipboard!

5.3. Making inference requests to models deployed on the single-model serving platformCopy linkLink copied to clipboard!

5.4. Inference endpointsCopy linkLink copied to clipboard!

5.4.1. Caikit TGIS ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.2. Caikit Standalone ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.3. TGIS Standalone ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.4. OpenVINO Model ServerCopy linkLink copied to clipboard!

5.4.5. vLLM NVIDIA GPU ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.7. vLLM AMD GPU ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServeCopy linkLink copied to clipboard!

5.4.9. NVIDIA Triton Inference ServerCopy linkLink copied to clipboard!

5.4.10. Seldon MLServerCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Accessing the authentication token for a deployed model
Copy link

5.2. Accessing the inference endpoint for a deployed model
Copy link

5.3. Making inference requests to models deployed on the single-model serving platform
Copy link

5.4. Inference endpoints
Copy link

5.4.1. Caikit TGIS ServingRuntime for KServe
Copy link

5.4.2. Caikit Standalone ServingRuntime for KServe
Copy link

5.4.3. TGIS Standalone ServingRuntime for KServe
Copy link

5.4.4. OpenVINO Model Server
Copy link

5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe
Copy link

5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
Copy link

5.4.7. vLLM AMD GPU ServingRuntime for KServe
Copy link

5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe
Copy link

5.4.9. NVIDIA Triton Inference Server
Copy link

5.4.10. Seldon MLServer
Copy link