Chapter 5. Making inference requests to deployed models
When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.
5.1. Accessing the authentication token for a deployed model Copy linkLink copied to clipboard!
If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have deployed a model by using the single-model serving platform.
Procedure
From the OpenShift AI dashboard, click Data science projects.
The Data science projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
- Click the Models tab.
In the Models and model servers list, expand the section for your model.
Your authentication token is shown in the Token authentication section, in the Token secret field.
-
Optional: To copy the authentication token for use in an inference request, click the Copy button (
) next to the token value.
5.2. Accessing the inference endpoint for a deployed model Copy linkLink copied to clipboard!
To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.
For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have deployed a model by using the single-model serving platform.
- If you enabled token authentication for your deployed model, you have the associated token value.
Procedure
From the OpenShift AI dashboard, click Models
Model deployments. The inference endpoint for the model is shown in the Inference endpoint field.
- Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
- Use the endpoint to make API requests to your deployed model.
5.3. Making inference requests to models deployed on the single-model serving platform Copy linkLink copied to clipboard!
When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.
In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.
5.4. Inference endpoints Copy linkLink copied to clipboard!
These examples show how to use inference endpoints to query the model.
If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.
5.4.1. Caikit TGIS ServingRuntime for KServe Copy linkLink copied to clipboard!
-
:443/api/v1/task/text-generation -
:443/api/v1/task/server-streaming-text-generation
Example command
curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
5.4.2. Caikit Standalone ServingRuntime for KServe Copy linkLink copied to clipboard!
If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.
REST endpoints
-
/api/v1/task/embedding -
/api/v1/task/embedding-tasks -
/api/v1/task/sentence-similarity -
/api/v1/task/sentence-similarity-tasks -
/api/v1/task/rerank -
/api/v1/task/rerank-tasks -
/info/models -
/info/version -
/info/runtime
gRPC endpoints
-
:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict -
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict -
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict -
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict -
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict -
:443 caikit.runtime.info.InfoService/GetModelsInfo -
:443 caikit.runtime.info.InfoService/GetRuntimeInfo
By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
An example manifest is available in the caikit-tgis-serving GitHub repository.
Example command
REST
curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
gRPC
grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
5.4.3. TGIS Standalone ServingRuntime for KServe Copy linkLink copied to clipboard!
The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see OpenShift AI release notes.
-
:443 fmaas.GenerationService/Generate :443 fmaas.GenerationService/GenerateStreamNoteTo query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the OpenShift AI
text-generation-inferencerepository.
Example command
grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate
5.4.4. OpenVINO Model Server Copy linkLink copied to clipboard!
-
/v2/models/<model-name>/infer
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe Copy linkLink copied to clipboard!
-
:443/version -
:443/docs -
:443/v1/models -
:443/v1/chat/completions -
:443/v1/completions -
:443/v1/embeddings -
:443/tokenize :443/detokenizeNote- The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
- To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
As of vLLM v0.5.5, you must provide a chat template while querying a model using the
/v1/chat/completionsendpoint. If your model does not include a predefined chat template, you can use thechat-templatecommand-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace<CHAT_TEMPLATE>with the path to your template.containers: - args: - --chat-template=<CHAT_TEMPLATE>You can use the chat templates that are available as
.jinjafiles here or with the vLLM image under/app/data/template. For more information, see Chat templates.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
Example command
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe Copy linkLink copied to clipboard!
5.4.7. vLLM AMD GPU ServingRuntime for KServe Copy linkLink copied to clipboard!
5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe Copy linkLink copied to clipboard!
Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
5.4.9. NVIDIA Triton Inference Server Copy linkLink copied to clipboard!
REST endpoints
-
v2/models/[/versions/<model_version>]/infer -
v2/models/<model_name>[/versions/<model_version>] -
v2/health/ready -
v2/health/live -
v2/models/<model_name>[/versions/]/ready -
v2
ModelMesh does not support the following REST endpoints:
-
v2/health/live -
v2/health/ready -
v2/models/<model_name>[/versions/]/ready
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
gRPC endpoints
-
:443 inference.GRPCInferenceService/ModelInfer -
:443 inference.GRPCInferenceService/ModelReady -
:443 inference.GRPCInferenceService/ModelMetadata -
:443 inference.GRPCInferenceService/ServerReady -
:443 inference.GRPCInferenceService/ServerLive -
:443 inference.GRPCInferenceService/ServerMetadata
Example command
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
5.4.10. Seldon MLServer Copy linkLink copied to clipboard!
REST endpoints
-
v2/models/[/versions/<model_version>]/infer -
v2/models/<model_name>[/versions/<model_version>] -
v2/health/ready -
v2/health/live -
v2/models/<model_name>[/versions/]/ready -
v2
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
gRPC endpoints
-
:443 inference.GRPCInferenceService/ModelInfer -
:443 inference.GRPCInferenceService/ModelReady -
:443 inference.GRPCInferenceService/ModelMetadata -
:443 inference.GRPCInferenceService/ServerReady -
:443 inference.GRPCInferenceService/ServerLive -
:443 inference.GRPCInferenceService/ServerMetadata
Example command
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata