Chapter 4. Making inference requests to deployed models
When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.
4.1. Accessing the authentication token for a deployed model Copy linkLink copied to clipboard!
If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have deployed a model by using the model serving platform.
Procedure
From the OpenShift AI dashboard, click Projects.
The Projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
- Click the Deployments tab.
In the Deployments list, expand the section for your model.
Your authentication token is shown in the Token authentication section, in the Token secret field.
-
Optional: To copy the authentication token for use in an inference request, click the Copy button (
) next to the token value.
4.2. Accessing the inference endpoint for a deployed model Copy linkLink copied to clipboard!
To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.
For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have deployed a model by using the model serving platform.
- If you enabled token authentication for your deployed model, you have the associated token value.
Procedure
From the OpenShift AI dashboard, click AI hub
Deployments. The inference endpoint for the model is shown in the Inference endpoints field.
- Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
- Use the endpoint to make API requests to your deployed model.
4.3. Making inference requests to models deployed on the model serving platform Copy linkLink copied to clipboard!
When you deploy a model by using the model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.
In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.
4.4. Inference endpoints Copy linkLink copied to clipboard!
These examples show how to use inference endpoints to query the model.
If you enabled token authentication when deploying the model, add the Authorization header and specify a token value.
4.4.1. Caikit TGIS ServingRuntime for KServe Copy linkLink copied to clipboard!
-
:443/api/v1/task/text-generation -
:443/api/v1/task/server-streaming-text-generation
Example command
curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
4.4.2. OpenVINO Model Server Copy linkLink copied to clipboard!
-
/v2/models/<model-name>/infer
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
4.4.3. vLLM NVIDIA GPU ServingRuntime for KServe Copy linkLink copied to clipboard!
-
:443/version -
:443/docs -
:443/v1/models -
:443/v1/chat/completions -
:443/v1/completions -
:443/v1/embeddings -
:443/tokenize :443/detokenizeNote- The vLLM runtime is compatible with the OpenAI REST API.
- To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
As of vLLM v0.5.5, you must provide a chat template while querying a model using the
/v1/chat/completionsendpoint. If your model does not include a predefined chat template, you can use thechat-templatecommand-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace<CHAT_TEMPLATE>with the path to your template.containers: - args: - --chat-template=<CHAT_TEMPLATE>containers: - args: - --chat-template=<CHAT_TEMPLATE>Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can use the chat templates that are available as
.jinjafiles here or with the vLLM image under/app/data/template. For more information, see Chat templates.
As indicated by the paths shown, the model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
Example command
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
4.4.4. vLLM Intel Gaudi Accelerator ServingRuntime for KServe Copy linkLink copied to clipboard!
4.4.5. vLLM AMD GPU ServingRuntime for KServe Copy linkLink copied to clipboard!
4.4.6. vLLM Spyre AI Accelerator ServingRuntime for KServe Copy linkLink copied to clipboard!
Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
You can serve models with IBM Spyre AI accelerators on x86 by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.
4.4.7. vLLM Spyre s390x ServingRuntime for KServe Copy linkLink copied to clipboard!
You can serve models with IBM Spyre AI accelerators on IBM Z (s390x architecture) by using the vLLM Spyre s390x ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.
4.4.8. NVIDIA Triton Inference Server Copy linkLink copied to clipboard!
REST endpoints
-
v2/models/[/versions/<model_version>]/infer -
v2/models/<model_name>[/versions/<model_version>] -
v2/health/ready -
v2/health/live -
v2/models/<model_name>[/versions/]/ready -
v2
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
gRPC endpoints
-
:443 inference.GRPCInferenceService/ModelInfer -
:443 inference.GRPCInferenceService/ModelReady -
:443 inference.GRPCInferenceService/ModelMetadata -
:443 inference.GRPCInferenceService/ServerReady -
:443 inference.GRPCInferenceService/ServerLive -
:443 inference.GRPCInferenceService/ServerMetadata
Example command
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
4.4.9. Seldon MLServer Copy linkLink copied to clipboard!
REST endpoints
-
v2/models/[/versions/<model_version>]/infer -
v2/models/<model_name>[/versions/<model_version>] -
v2/health/ready -
v2/health/live -
v2/models/<model_name>[/versions/]/ready -
v2
Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
gRPC endpoints
-
:443 inference.GRPCInferenceService/ModelInfer -
:443 inference.GRPCInferenceService/ModelReady -
:443 inference.GRPCInferenceService/ModelMetadata -
:443 inference.GRPCInferenceService/ServerReady -
:443 inference.GRPCInferenceService/ServerLive -
:443 inference.GRPCInferenceService/ServerMetadata
Example command
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata
grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata