Chapter 3. Serving large models
For deploying large models such as large language models (LLMs), Red Hat OpenShift AI includes a single model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.
3.1. About the single-model serving platform
For deploying large models such as large language models (LLMs), OpenShift AI includes a single-model serving platform that is based on the KServe component. Because each model is deployed on its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.
3.1.1. Components
- KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. KServe includes model-serving runtimes that implement the loading of given types of model servers. KServe also handles the lifecycle of the deployment object, storage access, and networking setup.
- Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.
- Red Hat OpenShift Service Mesh: A service mesh networking layer that manages traffic flows and enforces access policies. OpenShift Service Mesh is based on the open source Istio project.
3.1.2. Installation options
To install the single-model serving platform, you have the following options:
- Automated installation
If you have not already created a
ServiceMeshControlPlane
orKNativeServing
resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.For more information about automated installation, see Configuring automated installation of KServe.
- Manual installation
If you have already created a
ServiceMeshControlPlane
orKNativeServing
resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.For more information about manual installation, see Manually installing KServe.
3.1.3. Model-serving runtimes
When you have installed KServe, you can use the OpenShift AI dashboard to deploy models using pre-installed or custom model-serving runtimes.
OpenShift AI includes the following pre-installed runtimes for KServe:
- TGIS Standalone ServingRuntime for KServe: A runtime for serving TGI-enabled models
- Caikit-TGIS ServingRuntime for KServe: A composite runtime for serving models in the Caikit format
- Caikit Standalone ServingRuntime for KServe: A runtime for serving models in the Caikit embeddings format for embeddings tasks
- OpenVINO Model Server: A scalable, high-performance runtime for serving models that are optimized for Intel architectures
- vLLM ServingRuntime for KServe: A high-throughput and memory-efficient inference and serving runtime for large language models
- Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
- The composite Caikit-TGIS runtime is based on Caikit and Text Generation Inference Server (TGIS). To use this runtime, you must convert your models to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
- The Caikit Standalone runtime is based on Caikit NLP. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see Bootstrap Model.
3.1.4. Authorization
You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authorization for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.
To add Authorino as an authorization provider on the single-model serving platform, you have the following options:
- If automated installation of the single-model serving platform is possible on your cluster, you can include Authorino as part of the automated installation process.
- If you need to manually install the single-model serving platform, you must also manually configure Authorino.
For guidance on choosing an installation option for the single-model serving platform, see Installation options.
3.1.5. Monitoring
You can configure monitoring for the single-model serving platform and use Prometheus to scrape metrics for each of the pre-installed model-serving runtimes.
3.2. Configuring automated installation of KServe
If you have not already created a ServiceMeshControlPlane
or KNativeServing
resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and configure its dependencies.
If you have created a ServiceMeshControlPlane
or KNativeServing
resource on your cluster, the Red Hat OpenShift AI Operator cannot install KServe and configure its dependencies and the installation does not proceed. In this situation, you must follow the manual installation instructions to install KServe.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- Your cluster has a node with 4 CPUs and 16 GB memory.
- You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI.
You have installed the Red Hat OpenShift Service Mesh Operator and dependent Operators.
NoteTo enable automated installation of KServe, install only the required Operators for Red Hat OpenShift Service Mesh. Do not perform any additional configuration or create a
ServiceMeshControlPlane
resource.You have installed the Red Hat OpenShift Serverless Operator.
NoteTo enable automated installation of KServe, install only the Red Hat OpenShift Serverless Operator. Do not perform any additional configuration or create a
KNativeServing
resource.-
You have installed the Red Hat OpenShift AI Operator and created a
DataScienceCluster
object. -
To add Authorino as an authorization provider so that you can enable token authorization for deployed models, you have installed the
Red Hat - Authorino
Operator. See Installing the Authorino Operator.
Procedure
- Log in to the OpenShift web console as a cluster administrator.
-
In the web console, click Operators
Installed Operators and then click the Red Hat OpenShift AI Operator. Install OpenShift Service Mesh as follows:
- Click the DSC Initialization tab.
- Click the default-dsci object.
- Click the YAML tab.
In the
spec
section, validate that the value of themanagementState
field for theserviceMesh
component is set toManaged
, as shown:spec: applicationsNamespace: redhat-ods-applications monitoring: managementState: Managed namespace: redhat-ods-monitoring serviceMesh: controlPlane: metricsCollection: Istio name: data-science-smcp namespace: istio-system managementState: Managed
NoteDo not change the
istio-system
namespace that is specified for theserviceMesh
component by default. Other namespace values are not supported.Click Save.
Based on the configuration you added to the
DSCInitialization
object, the Red Hat OpenShift AI Operator installs OpenShift Service Mesh.
Install both KServe and OpenShift Serverless as follows:
-
In the web console, click Operators
Installed Operators and then click the Red Hat OpenShift AI Operator. - Click the Data Science Cluster tab.
- Click the default-dsc DSC object.
- Click the YAML tab.
In the
spec.components
section, configure thekserve
component as shown.spec: components: kserve: managementState: Managed serving: ingressGateway: certificate: secretName: knative-serving-cert type: SelfSigned managementState: Managed name: knative-serving
Click Save.
The preceding configuration creates an ingress gateway for OpenShift Serverless to receive traffic from OpenShift Service Mesh. In this configuration, observe the following details:
The configuration shown generates a self-signed certificate to secure incoming traffic to your OpenShift cluster and stores the certificate in the
knative-serving-cert
secret that is specified in thesecretName
field. To provide your own certificate, update the value of thesecretName
field to specify your secret name and change the value of the type field toProvided
.NoteIf you provide your own certificate, the certificate must specify the domain name used by the ingress controller of your OpenShift cluster. You can check this value by running the following command:
$ oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}'
-
You must set the value of the
managementState
field toManaged
for both thekserve
andserving
components. Settingkserve.managementState
toManaged
triggers automated installation of KServe. Settingserving.managementState
toManaged
triggers automated installation of OpenShift Serverless. However, installation of OpenShift Serverless will not be triggered ifkserve.managementState
is not also set toManaged
.
-
In the web console, click Operators
Verification
Verify installation of OpenShift Service Mesh as follows:
-
In the web console, click Workloads
Pods. - From the project list, select istio-system. This is the project in which OpenShift Service Mesh is installed.
Confirm that there are running pods for the service mesh control plane, ingress gateway, and egress gateway. These pods have the naming patterns shown in the following example:
NAME READY STATUS RESTARTS AGE istio-egressgateway-7c46668687-fzsqj 1/1 Running 0 22h istio-ingressgateway-77f94d8f85-fhsp9 1/1 Running 0 22h istiod-data-science-smcp-cc8cfd9b8-2rkg4 1/1 Running 0 22h
-
In the web console, click Workloads
Verify installation of OpenShift Serverless as follows:
-
In the web console, click Workloads
Pods. - From the project list, select knative-serving. This is the project in which OpenShift Serverless is installed.
Confirm that there are numerous running pods in the
knative-serving
project, including activator, autoscaler, controller, and domain mapping pods, as well as pods for the Knative Istio controller (which controls the integration of OpenShift Serverless and OpenShift Service Mesh). An example is shown.NAME READY STATUS RESTARTS AGE activator-7586f6f744-nvdlb 2/2 Running 0 22h activator-7586f6f744-sd77w 2/2 Running 0 22h autoscaler-764fdf5d45-p2v98 2/2 Running 0 22h autoscaler-764fdf5d45-x7dc6 2/2 Running 0 22h autoscaler-hpa-7c7c4cd96d-2lkzg 1/1 Running 0 22h autoscaler-hpa-7c7c4cd96d-gks9j 1/1 Running 0 22h controller-5fdfc9567c-6cj9d 1/1 Running 0 22h controller-5fdfc9567c-bf5x7 1/1 Running 0 22h domain-mapping-56ccd85968-2hjvp 1/1 Running 0 22h domain-mapping-56ccd85968-lg6mw 1/1 Running 0 22h domainmapping-webhook-769b88695c-gp2hk 1/1 Running 0 22h domainmapping-webhook-769b88695c-npn8g 1/1 Running 0 22h net-istio-controller-7dfc6f668c-jb4xk 1/1 Running 0 22h net-istio-controller-7dfc6f668c-jxs5p 1/1 Running 0 22h net-istio-webhook-66d8f75d6f-bgd5r 1/1 Running 0 22h net-istio-webhook-66d8f75d6f-hld75 1/1 Running 0 22h webhook-7d49878bc4-8xjbr 1/1 Running 0 22h webhook-7d49878bc4-s4xx4 1/1 Running 0 22h
-
In the web console, click Workloads
Verify installation of KServe as follows:
-
In the web console, click Workloads
Pods. - From the project list, select redhat-ods-applications.This is the project in which OpenShift AI components are installed, including KServe.
Confirm that the project includes a running pod for the KServe controller manager, similar to the following example:
NAME READY STATUS RESTARTS AGE kserve-controller-manager-7fbb7bccd4-t4c5g 1/1 Running 0 22h odh-model-controller-6c4759cc9b-cftmk 1/1 Running 0 129m odh-model-controller-6c4759cc9b-ngj8b 1/1 Running 0 129m odh-model-controller-6c4759cc9b-vnhq5 1/1 Running 0 129m
-
In the web console, click Workloads
3.3. Manually installing KServe
If you have already installed the Red Hat OpenShift Service Mesh Operator and created a ServiceMeshControlPlane
resource or if you have installed the Red Hat OpenShift Serverless Operator and created a KNativeServing
resource, the Red Hat OpenShift AI Operator cannot install KServe and configure its dependencies. In this situation, you must install KServe manually.
The procedures in this section show how to perform a new installation of KServe and its dependencies and are intended as a complete installation and configuration reference. If you have already installed and configured OpenShift Service Mesh or OpenShift Serverless, you might not need to follow all steps. If you are unsure about what updates to apply to your existing configuration to use KServe, contact Red Hat Support.
3.3.1. Installing KServe dependencies
Before you install KServe, you must install and configure some dependencies. Specifically, you must create Red Hat OpenShift Service Mesh and Knative Serving instances and then configure secure gateways for Knative Serving.
3.3.1.1. Creating an OpenShift Service Mesh instance
The following procedure shows how to create a Red Hat OpenShift Service Mesh instance.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- Your cluster has a node with 4 CPUs and 16 GB memory.
- You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You have installed the Red Hat OpenShift Service Mesh Operator and dependent Operators.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
Create the required namespace for Red Hat OpenShift Service Mesh.
$ oc create ns istio-system
You see the following output:
namespace/istio-system created
Define a
ServiceMeshControlPlane
object in a YAML file namedsmcp.yaml
with the following contents:apiVersion: maistra.io/v2 kind: ServiceMeshControlPlane metadata: name: minimal namespace: istio-system spec: tracing: type: None addons: grafana: enabled: false kiali: name: kiali enabled: false prometheus: enabled: false jaeger: name: jaeger security: dataPlane: mtls: true identity: type: ThirdParty techPreview: meshConfig: defaultConfig: terminationDrainDuration: 35s gateways: ingress: service: metadata: labels: knative: ingressgateway proxy: networking: trafficControl: inbound: excludedPorts: - 8444 - 8022
For more information about the values in the YAML file, see the Service Mesh control plane configuration reference.
Create the service mesh control plane.
$ oc apply -f smcp.yaml
Verification
Verify creation of the service mesh instance as follows:
In the OpenShift CLI, enter the following command:
$ oc get pods -n istio-system
The preceding command lists all running pods in the
istio-system
project. This is the project in which OpenShift Service Mesh is installed.Confirm that there are running pods for the service mesh control plane, ingress gateway, and egress gateway. These pods have the following naming patterns:
NAME READY STATUS RESTARTS AGE istio-egressgateway-7c46668687-fzsqj 1/1 Running 0 22h istio-ingressgateway-77f94d8f85-fhsp9 1/1 Running 0 22h istiod-data-science-smcp-cc8cfd9b8-2rkg4 1/1 Running 0 22h
3.3.1.2. Creating a Knative Serving instance
The following procedure shows how to install Knative Serving and then create an instance.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- Your cluster has a node with 4 CPUs and 16 GB memory.
- You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You have created a Red Hat OpenShift Service Mesh instance.
- You have installed the Red Hat OpenShift Serverless Operator.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
Check whether the required project (that is, namespace) for Knative Serving already exists.
$ oc get ns knative-serving
If the project exists, you see output similar to the following example:
NAME STATUS AGE knative-serving Active 4d20h
If the
knative-serving
project doesn’t already exist, create it.$ oc create ns knative-serving
You see the following output:
namespace/knative-serving created
Define a
ServiceMeshMember
object in a YAML file calleddefault-smm.yaml
with the following contents:apiVersion: maistra.io/v1 kind: ServiceMeshMember metadata: name: default namespace: knative-serving spec: controlPlaneRef: namespace: istio-system name: minimal
Create the
ServiceMeshMember
object in theistio-system
namespace.$ oc apply -f default-smm.yaml
You see the following output:
servicemeshmember.maistra.io/default created
Define a
KnativeServing
object in a YAML file calledknativeserving-istio.yaml
with the following contents:apiVersion: operator.knative.dev/v1beta1 kind: KnativeServing metadata: name: knative-serving namespace: knative-serving annotations: serverless.openshift.io/default-enable-http2: "true" spec: workloads: - name: net-istio-controller env: - container: controller envVars: - name: ENABLE_SECRET_INFORMER_FILTERING_BY_CERT_UID value: 'true' - annotations: sidecar.istio.io/inject: "true" 1 sidecar.istio.io/rewriteAppHTTPProbers: "true" 2 name: activator - annotations: sidecar.istio.io/inject: "true" sidecar.istio.io/rewriteAppHTTPProbers: "true" name: autoscaler ingress: istio: enabled: true config: features: kubernetes.podspec-affinity: enabled kubernetes.podspec-nodeselector: enabled kubernetes.podspec-tolerations: enabled
The preceding file defines a custom resource (CR) for a
KnativeServing
object. The CR also adds the following actions to each of the activator and autoscaler pods:NoteIf you configure a custom domain for a Knative service, you can use a TLS certificate to secure the mapped service. To do this, you must create a TLS secret, and then update the
DomainMapping
CR to use the TLS secret that you have created. For more information, see Securing a mapped service using a TLS certificate in the Red Hat OpenShift Serverless documentation.Create the
KnativeServing
object in the specifiedknative-serving
namespace.$ oc apply -f knativeserving-istio.yaml
You see the following output:
knativeserving.operator.knative.dev/knative-serving created
Verification
Review the default
ServiceMeshMemberRoll
object in theistio-system
namespace.$ oc describe smmr default -n istio-system
In the description of the
ServiceMeshMemberRoll
object, locate theStatus.Members
field and confirm that it includes theknative-serving
namespace.Verify creation of the Knative Serving instance as follows:
In the OpenShift CLI, enter the following command:
$ oc get pods -n knative-serving
The preceding command lists all running pods in the
knative-serving
project. This is the project in which you created the Knative Serving instance.Confirm that there are numerous running pods in the
knative-serving
project, including activator, autoscaler, controller, and domain mapping pods, as well as pods for the Knative Istio controller, which controls the integration of OpenShift Serverless and OpenShift Service Mesh. An example is shown.NAME READY STATUS RESTARTS AGE activator-7586f6f744-nvdlb 2/2 Running 0 22h activator-7586f6f744-sd77w 2/2 Running 0 22h autoscaler-764fdf5d45-p2v98 2/2 Running 0 22h autoscaler-764fdf5d45-x7dc6 2/2 Running 0 22h autoscaler-hpa-7c7c4cd96d-2lkzg 1/1 Running 0 22h autoscaler-hpa-7c7c4cd96d-gks9j 1/1 Running 0 22h controller-5fdfc9567c-6cj9d 1/1 Running 0 22h controller-5fdfc9567c-bf5x7 1/1 Running 0 22h domain-mapping-56ccd85968-2hjvp 1/1 Running 0 22h domain-mapping-56ccd85968-lg6mw 1/1 Running 0 22h domainmapping-webhook-769b88695c-gp2hk 1/1 Running 0 22h domainmapping-webhook-769b88695c-npn8g 1/1 Running 0 22h net-istio-controller-7dfc6f668c-jb4xk 1/1 Running 0 22h net-istio-controller-7dfc6f668c-jxs5p 1/1 Running 0 22h net-istio-webhook-66d8f75d6f-bgd5r 1/1 Running 0 22h net-istio-webhook-66d8f75d6f-hld75 1/1 Running 0 22h webhook-7d49878bc4-8xjbr 1/1 Running 0 22h webhook-7d49878bc4-s4xx4 1/1 Running 0 22h
3.3.1.3. Creating secure gateways for Knative Serving
To secure traffic between your Knative Serving instance and the service mesh, you must create secure gateways for your Knative Serving instance.
The following procedure shows how to use OpenSSL to generate a wildcard certificate and key and then use them to create local and ingress gateways for Knative Serving.
If you have your own wildcard certificate and key to specify when configuring the gateways, you can skip to step 11 of this procedure.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You have created a Red Hat OpenShift Service Mesh instance.
- You have created a Knative Serving instance.
- If you intend to generate a wildcard certificate and key, you have downloaded and installed OpenSSL.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
ImportantIf you have your own wildcard certificate and key to specify when configuring the gateways, skip to step 11 of this procedure.
Set environment variables to define base directories for generation of a wildcard certificate and key for the gateways.
$ export BASE_DIR=/tmp/kserve $ export BASE_CERT_DIR=${BASE_DIR}/certs
Set an environment variable to define the common name used by the ingress controller of your OpenShift cluster.
$ export COMMON_NAME=$(oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}' | awk -F'.' '{print $(NF-1)"."$NF}')
Set an environment variable to define the domain name used by the ingress controller of your OpenShift cluster.
$ export DOMAIN_NAME=$(oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}')
Create the required base directories for the certificate generation, based on the environment variables that you previously set.
$ mkdir ${BASE_DIR} $ mkdir ${BASE_CERT_DIR}
Create the OpenSSL configuration for generation of a wildcard certificate.
$ cat <<EOF> ${BASE_DIR}/openssl-san.config [ req ] distinguished_name = req [ san ] subjectAltName = DNS:*.${DOMAIN_NAME} EOF
Generate a root certificate.
$ openssl req -x509 -sha256 -nodes -days 3650 -newkey rsa:2048 \ -subj "/O=Example Inc./CN=${COMMON_NAME}" \ -keyout $BASE_DIR/root.key \ -out $BASE_DIR/root.crt
Generate a wildcard certificate signed by the root certificate.
$ openssl req -x509 -newkey rsa:2048 \ -sha256 -days 3560 -nodes \ -subj "/CN=${COMMON_NAME}/O=Example Inc." \ -extensions san -config ${BASE_DIR}/openssl-san.config \ -CA $BASE_DIR/root.crt \ -CAkey $BASE_DIR/root.key \ -keyout $BASE_DIR/wildcard.key \ -out $BASE_DIR/wildcard.crt $ openssl x509 -in ${BASE_DIR}/wildcard.crt -text
Verify the wildcard certificate.
$ openssl verify -CAfile ${BASE_DIR}/root.crt ${BASE_DIR}/wildcard.crt
Export the wildcard key and certificate that were created by the script to new environment variables.
$ export TARGET_CUSTOM_CERT=${BASE_CERT_DIR}/wildcard.crt $ export TARGET_CUSTOM_KEY=${BASE_CERT_DIR}/wildcard.key
Optional: To export your own wildcard key and certificate to new environment variables, enter the following commands:
$ export TARGET_CUSTOM_CERT=<path_to_certificate> $ export TARGET_CUSTOM_KEY=<path_to_key>
NoteIn the certificate that you provide, you must specify the domain name used by the ingress controller of your OpenShift cluster. You can check this value by running the following command:
$ oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}'
Create a TLS secret in the
istio-system
namespace using the environment variables that you set for the wildcard certificate and key.$ oc create secret tls wildcard-certs --cert=${TARGET_CUSTOM_CERT} --key=${TARGET_CUSTOM_KEY} -n istio-system
Create a
gateways.yaml
YAML file with the following contents:apiVersion: v1 kind: Service 1 metadata: labels: experimental.istio.io/disable-gateway-port-translation: "true" name: knative-local-gateway namespace: istio-system spec: ports: - name: http2 port: 80 protocol: TCP targetPort: 8081 selector: knative: ingressgateway type: ClusterIP --- apiVersion: networking.istio.io/v1beta1 kind: Gateway metadata: name: knative-ingress-gateway 2 namespace: knative-serving spec: selector: knative: ingressgateway servers: - hosts: - '*' port: name: https number: 443 protocol: HTTPS tls: credentialName: wildcard-certs mode: SIMPLE --- apiVersion: networking.istio.io/v1beta1 kind: Gateway metadata: name: knative-local-gateway 3 namespace: knative-serving spec: selector: knative: ingressgateway servers: - port: number: 8081 name: https protocol: HTTPS tls: mode: ISTIO_MUTUAL hosts: - "*"
- 1
- Defines a service in the
istio-system
namespace for the Knative local gateway. - 2
- Defines an ingress gateway in the
knative-serving namespace
. The gateway uses the TLS secret you created earlier in this procedure. The ingress gateway handles external traffic to Knative. - 3
- Defines a local gateway for Knative in the
knative-serving
namespace.
Apply the
gateways.yaml
file to create the defined resources.$ oc apply -f gateways.yaml
You see the following output:
service/knative-local-gateway created gateway.networking.istio.io/knative-ingress-gateway created gateway.networking.istio.io/knative-local-gateway created
Verification
Review the gateways that you created.
$ oc get gateway --all-namespaces
Confirm that you see the local and ingress gateways that you created in the
knative-serving
namespace, as shown in the following example:NAMESPACE NAME AGE knative-serving knative-ingress-gateway 69s knative-serving knative-local-gateway 2m
3.3.2. Installing KServe
To complete manual installation of KServe, you must install the Red Hat OpenShift AI Operator. Then, you can configure the Operator to install KServe.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- Your cluster has a node with 4 CPUs and 16 GB memory.
- You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You have created a Red Hat OpenShift Service Mesh instance.
- You have created a Knative Serving instance.
- You have created secure gateways for Knative Serving.
-
You have installed the Red Hat OpenShift AI Operator and created a
DataScienceCluster
object.
Procedure
- Log in to the OpenShift web console as a cluster administrator.
-
In the web console, click Operators
Installed Operators and then click the Red Hat OpenShift AI Operator. For installation of KServe, configure the OpenShift Service Mesh component as follows:
- Click the DSC Initialization tab.
- Click the default-dsci object.
- Click the YAML tab.
In the
spec
section, add and configure theserviceMesh
component as shown:spec: serviceMesh: managementState: Unmanaged
- Click Save.
For installation of KServe, configure the KServe and OpenShift Serverless components as follows:
-
In the web console, click Operators
Installed Operators and then click the Red Hat OpenShift AI Operator. - Click the Data Science Cluster tab.
- Click the default-dsc DSC object.
- Click the YAML tab.
In the
spec.components
section, configure thekserve
component as shown:spec: components: kserve: managementState: Managed
Within the
kserve
component, add theserving
component, and configure it as shown:spec: components: kserve: managementState: Managed serving: managementState: Unmanaged
- Click Save.
-
In the web console, click Operators
3.3.3. Manually adding an authorization provider
You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authorization for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.
To manually add Authorino as an authorization provider, you must install the Red Hat - Authorino
Operator, create an Authorino instance, and then configure the OpenShift Service Mesh and KServe components to use the instance.
To manually add an authorization provider, you must make configuration updates to your OpenShift Service Mesh instance. To ensure that your OpenShift Service Mesh instance remains in a supported state, make only the updates shown in this section.
Prerequisites
- You have reviewed the options for adding Authorino as an authorization provider and identified manual installation as the appropriate option. See Adding an authorization provider.
- You have manually installed KServe and its dependencies, including OpenShift Service Mesh. See Manually installing KServe.
-
When you manually installed KServe, you set the value of the
managementState
field for theserviceMesh
component toUnmanaged
. This setting is required for manually adding Authorino. See Installing KServe.
3.3.3.1. Installing the Red Hat Authorino Operator
Before you can add Autorino as an authorization provider, you must install the Red Hat - Authorino
Operator on your OpenShift cluster.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
Procedure
- Log in to the OpenShift web console as a cluster administrator.
-
In the web console, click Operators
OperatorHub. - On the OperatorHub page, in the Filter by keyword field, type Red Hat - Authorino.
- Click the Red Hat - Authorino Operator.
- On the Red Hat - Authorino Operator page, review the Operator information and then click Install.
- On the Install Operator page, keep the default values for Update channel, Version, Installation mode, Installed Namespace and Update Approval.
- Click Install.
Verification
In the OpenShift web console, click Operators
Installed Operators and confirm that the Red Hat - Authorino
Operator shows one of the following statuses:-
Installing
- installation is in progress; wait for this to change toSucceeded
. This might take several minutes. -
Succeeded
- installation is successful.
-
3.3.3.2. Creating an Authorino instance
When you have installed the Red Hat - Authorino
Operator on your OpenShift cluster, you must create an Authorino instance.
Prerequisites
-
You have installed the
Red Hat - Authorino
Operator. You have privileges to add resources to the project in which your OpenShift Service Mesh instance was created. See Creating an OpenShift Service Mesh instance.
For more information about OpenShift permissions, see Using RBAC to define and apply permissions.
Procedure
- Open a new terminal window.
Log in to the OpenShift command-line interface (CLI) as follows:
$ oc login <openshift_cluster_url> -u <username> -p <password>
Create a namespace to install the Authorino instance.
$ oc new-project <namespace_for_authorino_instance>
NoteThe automated installation process creates a namespace called
redhat-ods-applications-auth-provider
for the Authorino instance. Consider using the same namespace name for the manual installation.To enroll the new namespace for the Authorino instance in your existing OpenShift Service Mesh instance, create a new YAML file with the following contents:
apiVersion: maistra.io/v1 kind: ServiceMeshMember metadata: name: default namespace: <namespace_for_authorino_instance> spec: controlPlaneRef: namespace: <namespace_for_service_mesh_instance> name: <name_of_service_mesh_instance>
- Save the YAML file.
Create the
ServiceMeshMember
resource on your cluster.$ oc create -f <file_name>.yaml
To configure an Authorino instance, create a new YAML file as shown in the following example:
apiVersion: operator.authorino.kuadrant.io/v1beta1 kind: Authorino metadata: name: authorino namespace: <namespace_for_authorino_instance> spec: authConfigLabelSelectors: security.opendatahub.io/authorization-group=default clusterWide: true listener: tls: enabled: false oidcServer: tls: enabled: false
- Save the YAML file.
Create the
Authorino
resource on your cluster.$ oc create -f <file_name>.yaml
Patch the Authorino deployment to inject an Istio sidecar, which makes the Authorino instance part of your OpenShift Service Mesh instance.
$ oc patch deployment <name_of_authorino_instance> -n <namespace_for_authorino_instance> -p '{"spec": {"template":{"metadata":{"labels":{"sidecar.istio.io/inject":"true"}}}} }'
Verification
Confirm that the Authorino instance is running as follows:
Check the pods (and containers) that are running in the namespace that you created for the Authorino instance, as shown in the following example:
$ oc get pods -n redhat-ods-applications-auth-provider -o="custom-columns=NAME:.metadata.name,STATUS:.status.phase,CONTAINERS:.spec.containers[*].name"
Confirm that the output resembles the following example:
NAME STATUS CONTAINERS authorino-6bc64bd667-kn28z Running authorino,istio-proxy
As shown in the example, there is a single running pod for the Authorino instance. The pod has containers for Authorino and for the Istio sidecar that you injected.
3.3.3.3. Configuring an OpenShift Service Mesh instance to use Authorino
When you have created an Authorino instance, you must configure your OpenShift Service Mesh instance to use Authorino as an authorization provider.
To ensure that your OpenShift Service Mesh instance remains in a supported state, make only the configuration updates shown in the following procedure.
Prerequisites
- You have created an Authorino instance and enrolled the namespace for the Authorino instance in your OpenShift Service Mesh instance.
- You have privileges to modify the OpenShift Service Mesh instance. See Creating an OpenShift Service Mesh instance.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a user that has privileges to update the OpenShift Service Mesh instance, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
Create a new YAML file with the following contents:
spec: techPreview: meshConfig: extensionProviders: - name: redhat-ods-applications-auth-provider envoyExtAuthzGrpc: service: <name_of_authorino_instance>-authorino-authorization.<namespace_for_authorino_instance>.svc.cluster.local port: 50051
- Save the YAML file.
Use the
oc patch
command to apply the YAML file to your OpenShift Service Mesh instance.$ oc patch smcp <name_of_service_mesh_instance> --type merge -n <namespace_for_service_mesh_instance> --patch-file <file_name>.yaml
ImportantYou can apply the configuration shown as a patch only if you have not already specified other extension providers in your OpenShift Service Mesh instance. If you have already specified other extension providers, you must manually edit your
ServiceMeshControlPlane
resource to add the configuration.
Verification
Verify that your Authorino instance has been added as an extension provider in your OpenShift Service Mesh configuration as follows:
Inspect the
ConfigMap
object for your OpenShift Service Mesh instance:$ oc get configmap istio-<name_of_service_mesh_instance> -n <namespace_for_service_mesh_instance> --output=jsonpath={.data.mesh}
Confirm that you see output similar to the following example, which shows that the Authorino instance has been successfully added as an extension provider.
defaultConfig: discoveryAddress: istiod-data-science-smcp.istio-system.svc:15012 proxyMetadata: ISTIO_META_DNS_AUTO_ALLOCATE: "true" ISTIO_META_DNS_CAPTURE: "true" PROXY_XDS_VIA_AGENT: "true" terminationDrainDuration: 35s tracing: {} dnsRefreshRate: 300s enablePrometheusMerge: true extensionProviders: - envoyExtAuthzGrpc: port: 50051 service: authorino-authorino-authorization.opendatahub-auth-provider.svc.cluster.local name: opendatahub-auth-provider ingressControllerMode: "OFF" rootNamespace: istio-system trustDomain: null%
3.3.3.4. Configuring authorization for KServe
To configure the single-model serving platform to use Authorino, you must create a global AuthorizationPolicy
resource that is applied to the KServe predictor pods that are created when you deploy a model. In addition, to account for the multiple network hops that occur when you make an inference request to a model, you must create an EnvoyFilter
resource that continually resets the HTTP host header to the one initially included in the inference request.
Prerequisites
- You have created an Authorino instance and configured your OpenShift Service Mesh to use it.
- You have privileges to update the KServe deployment on your cluster.
- You have privileges to add resources to the project in which your OpenShift Service Mesh instance was created. See Creating an OpenShift Service Mesh instance.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a user that has privileges to update the KServe deployment, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
Create a new YAML file with the following contents:
apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: kserve-predictor spec: action: CUSTOM provider: name: redhat-ods-applications-auth-provider 1 rules: - to: - operation: notPaths: - /healthz - /debug/pprof/ - /metrics - /wait-for-drain selector: matchLabels: component: predictor
- 1
- The name that you specify must match the name of the extension provider that you added to your OpenShift Service Mesh instance.
- Save the YAML file.
Create the
AuthorizationPolicy
resource in the namespace for your OpenShift Service Mesh instance.$ oc create -n <namespace_for_service_mesh_instance> -f <file_name>.yaml
Create another new YAML file with the following contents:
apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: activator-host-header spec: priority: 20 workloadSelector: labels: component: predictor configPatches: - applyTo: HTTP_FILTER match: listener: filterChain: filter: name: envoy.filters.network.http_connection_manager patch: operation: INSERT_BEFORE value: name: envoy.filters.http.lua typed_config: '@type': type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua inlineCode: | function envoy_on_request(request_handle) local headers = request_handle:headers() if not headers then return end local original_host = headers:get("k-original-host") if original_host then port_seperator = string.find(original_host, ":", 7) if port_seperator then original_host = string.sub(original_host, 0, port_seperator-1) end headers:replace('host', original_host) end end
The
EnvoyFilter
resource shown continually resets the HTTP host header to the one initially included in any inference request.Create the
EnvoyFilter
resource in the namespace for your OpenShift Service Mesh instance.$ oc create -n <namespace_for_service_mesh_instance> -f <file_name>.yaml
Verification
Check that the
AuthorizationPolicy
resource was successfully created.$ oc get authorizationpolicies -n <namespace_for_service_mesh_instance>
Confirm that you see output similar to the following example:
NAME AGE kserve-predictor 28h
Check that the
EnvoyFilter
resource was successfully created.$ oc get envoyfilter -n <namespace_for_service_mesh_instance>
Confirm that you see output similar to the following example:
NAME AGE activator-host-header 28h
3.4. Adding an authorization provider for the single-model serving platform
You can add Authorino as an authorization provider for the single-model serving platform. Adding an authorization provider allows you to enable token authorization for models that you deploy on the platform, which ensures that only authorized parties can make inference requests to the models.
The method that you use to add Authorino as an authorization provider depends on how you install the single-model serving platform. The installation options for the platform are described as follows:
- Automated installation
If you have not already created a
ServiceMeshControlPlane
orKNativeServing
resource on your OpenShift cluster, you can configure the Red Hat OpenShift AI Operator to install KServe and its dependencies. You can include Authorino as part of the automated installation process.For more information about automated installation, including Authorino, see Configuring automated installation of KServe.
- Manual installation
If you have already created a
ServiceMeshControlPlane
orKNativeServing
resource on your OpenShift cluster, you cannot configure the Red Hat OpenShift AI Operator to install KServe and its dependencies. In this situation, you must install KServe manually. You must also manually configure Authorino.For more information about manual installation, including Authorino, see Manually installing KServe.
3.5. Deploying models by using the single-model serving platform
On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.
If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Working with certificates (OpenShift AI Self-Managed) or Working with certificates (OpenShift AI Self-Managed in a disconnected environment).
3.5.1. Enabling the single-model serving platform
When you have installed KServe, you can use the Red Hat OpenShift AI dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
-
If you are using specialized OpenShift AI groups, you are part of the admin group (for example,
rhoai-admins
) in OpenShift. - You have installed KServe.
- Your cluster administrator has not edited the OpenShift AI dashboard configuration to disable the ability to select the single-model serving platform, which uses the KServe component. For more information, see Dashboard configuration options.
Procedure
Enable the single-model serving platform as follows:
-
In the left menu, click Settings
Cluster settings. - Locate the Model serving platforms section.
- To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
- Click Save changes.
-
In the left menu, click Settings
Enable pre-installed runtimes for the single-model serving platform as follows:
In the left menu of the OpenShift AI dashboard, click Settings
Serving runtimes. The Serving runtimes page shows any custom runtimes that you have added, as well as the following pre-installed runtimes:
- Caikit TGIS ServingRuntime for KServe
- Caikit Standalone ServingRuntime for KServe
- OpenVINO Model Server
- TGIS Standalone ServingRuntime for KServe
- vLLM ServingRuntime for KServe
Set the runtime that you want to use to Enabled.
The single-model serving platform is now available for model deployments.
3.5.2. Adding a custom model-serving runtime for the single-model serving platform
A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the pre-installed runtimes that are included with OpenShift AI. You can also add your own custom runtimes if the default runtimes do not meet your needs. For example, if the TGIS runtime does not support a model format that is supported by Hugging Face Text Generation Inference (TGI), you can create a custom runtime to add support for the model.
As an administrator, you can use the OpenShift AI interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.
OpenShift AI enables you to add your own custom runtimes, but does not support the runtimes themselves. You are responsible for correctly configuring and maintaining custom runtimes. You are also responsible for ensuring that you are licensed to use any custom runtimes that you add.
Prerequisites
- You have logged in to OpenShift AI as an administrator.
- You have built your custom runtime and added the image to a container image repository such as Quay.
Procedure
From the OpenShift AI dashboard, click Settings > Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, TGIS Standalone ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
- In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
- In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
Upload a YAML file
- Click Upload files.
In the file browser, select a YAML file on your computer.
The embedded YAML editor opens and shows the contents of the file that you uploaded.
Enter YAML code directly in the editor
- Click Start from scratch.
- Enter or paste YAML code directly in the embedded editor.
NoteIn many cases, creating a custom runtime will require adding new or custom parameters to the
env
section of theServingRuntime
specification.Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
- Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.
Verification
- The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
3.5.3. Deploying models on the single-model serving platform
When you have enabled the single-model serving platform, you can enable a pre-installed or custom model-serving runtime and start to deploy models on the platform.
Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of OpenShift AI, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
-
If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example,
rhoai-users
orrhoai-admins
) in OpenShift. - You have installed KServe.
- You have enabled the single-model serving platform.
- You have created a data science project.
- You have access to S3-compatible object storage.
- For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
- To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
- If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. See Enabling GPU support in OpenShift AI.
- To use the vLLM runtime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see Installing the Node Feature Discovery operator and Enabling GPU support in OpenShift AI
Procedure
In the left menu, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that you want to deploy a model in.
A project details page opens.
- Click the Models tab.
Perform one of the following actions:
- If you see a Single-model serving platform tile, click Deploy model on the tile.
- If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
- In the Model name field, enter a unique name for the model that you are deploying.
- In the Serving runtime field, select an enabled runtime.
- From the Model framework list, select a value.
- In the Number of model replicas to deploy field, specify a value.
- From the Model server size list, select a value.
To require token authorization for inference requests to the deployed model, perform the following actions:
- Select Require token authorization.
- In the Service account name field, enter the service account name that the token will be generated for.
To specify the location of your model, perform one of the following sets of actions:
To use an existing data connection
- Select Existing data connection.
- From the Name list, select a data connection that you previously defined.
In the Path field, enter the folder path that contains the model in your specified data source.
ImportantThe OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
To use a new data connection
- To define a new data connection that your model can access, select New data connection.
- In the Name field, enter a unique name for the data connection.
- In the Access key field, enter the access key ID for your S3-compatible object storage provider.
- In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
- In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
- In the Region field, enter the default region of your S3-compatible object storage account.
- In the Bucket field, enter the name of your S3-compatible object storage bucket.
In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.
ImportantThe OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
- Click Deploy.
Verification
- Confirm that the deployed model is shown in the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.
3.6. Making inference requests to models deployed on the single-model serving platform
When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.
In addition, if you secured your inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify this in your inference requests.
3.6.1. Accessing the authorization token for a deployed model
If you secured your model inference endpoint by enabling token authorization, you must know how to access your authorization token so that you can specify it in your inference requests.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
-
If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example,
rhoai-users
orrhoai-admins
) in OpenShift. - You have deployed a model by using the single-model serving platform.
Procedure
From the OpenShift AI dashboard, click Data Science Projects.
The Data Science Projects page opens.
Click the name of the project that contains your deployed model.
A project details page opens.
- Click the Models tab.
In the Models and model servers list, expand the section for your model.
Your authorization token is shown in the Token authorization section, in the Token secret field.
- Optional: To copy the authorization token for use in an inference request, click the Copy button ( ) next to the token value.
3.6.2. Accessing the inference endpoint for a deployed model
To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
-
If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example,
rhoai-users
orrhoai-admins
) in OpenShift. - You have deployed a model by using the single-model serving platform.
- If you enabled token authorization for your deployed model, you have the associated token value.
Procedure
From the OpenShift AI dashboard, click Model Serving.
The inference endpoint for the model is shown in the Inference endpoint field.
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint shown and then add one of the following paths to the end of the URL:
Caikit TGIS ServingRuntime for KServe
-
:443/api/v1/task/text-generation
-
:443/api/v1/task/server-streaming-text-generation
Caikit Standalone ServingRuntime for KServe
REST endpoints
-
/api/v1/task/embedding
-
/api/v1/task/embedding-tasks
-
/api/v1/task/sentence-similarity
-
/api/v1/task/sentence-similarity-tasks
-
/api/v1/task/rerank
-
/api/v1/task/rerank-tasks
gRPC endpoints
-
:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
-
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
-
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
NoteBy default, the Caikit Standalone Runtime exposes REST endpoints for use. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
An example manifest is available in the caikit-tgis-serving GitHub repository.
TGIS Standalone ServingRuntime for KServe
-
:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream
NoteTo query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the Open Data Hub
text-generation-inference
repository.
OpenVINO Model Server
-
/v2/models/<model-name>/infer
vLLM ServingRuntime for KServe
-
:443/version
-
:443/docs
-
:443/v1/models
-
:443/v1/chat/completions
-
:443/v1/completions
:443/v1/embeddings
NoteThe vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
NoteTo use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
-
Use the endpoint to make API requests to your deployed model, as shown in the following example commands:
NoteIf you enabled token authorization when deploying the model, add the
Authorization
header and specify a token value.Caikit TGIS ServingRuntime for KServe
curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
Caikit Standalone ServingRuntime for KServe .REST
curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
gRPC
grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
TGIS Standalone ServingRuntime for KServe
grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate
OpenVINO Model Server
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
vLLM ServingRuntime for KServe
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
3.7. Configuring monitoring for the single-model serving platform
The single-model serving platform includes metrics for supported runtimes of the KServe component. KServe does not generate its own metrics and relies on the underlying model-serving runtimes to provide them. The set of available metrics for a deployed model depends on its model-serving runtime.
In addition to runtime metrics for KServe, you can also configure monitoring for OpenShift Service Mesh. The OpenShift Service Mesh metrics help you to understand dependencies and traffic flow between components in the mesh.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.
- You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
- You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
-
You have assigned the
monitoring-rules-view
role to users that will monitor metrics.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
Define a
ConfigMap
object in a YAML file calleduwm-cm-conf.yaml
with the following contents:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: logLevel: debug retention: 15d
The
user-workload-monitoring-config
object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.Apply the configuration to create the
user-workload-monitoring-config
object.$ oc apply -f uwm-cm-conf.yaml
Define another
ConfigMap
object in a YAML file calleduwm-cm-enable.yaml
with the following contents:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
The
cluster-monitoring-config
object enables monitoring for user-defined projects.Apply the configuration to create the
cluster-monitoring-config
object.$ oc apply -f uwm-cm-enable.yaml
Create
ServiceMonitor
andPodMonitor
objects to monitor metrics in the service mesh control plane as follows:Create an
istiod-monitor.yaml
YAML file with the following contents:apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istiod-monitor namespace: istio-system spec: targetLabels: - app selector: matchLabels: istio: pilot endpoints: - port: http-monitoring interval: 30s
Deploy the
ServiceMonitor
CR in the specifiedistio-system
namespace.$ oc apply -f istiod-monitor.yaml
You see the following output:
servicemonitor.monitoring.coreos.com/istiod-monitor created
Create an
istio-proxies-monitor.yaml
YAML file with the following contents:apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: istio-proxies-monitor namespace: istio-system spec: selector: matchExpressions: - key: istio-prometheus-ignore operator: DoesNotExist podMetricsEndpoints: - path: /stats/prometheus interval: 30s
Deploy the
PodMonitor
CR in the specifiedistio-system
namespace.$ oc apply -f istio-proxies-monitor.yaml
You see the following output:
podmonitor.monitoring.coreos.com/istio-proxies-monitor created
3.8. Viewing model-serving runtime metrics for the single-model serving platform
When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.
Prerequisites
- A cluster administrator has configured monitoring for the single-model serving platform.
-
You have been assigned the
monitoring-rules-view
role. - You are familiar with how to monitor project metrics in the OpenShift web console.
Procedure
- Log in to the OpenShift web console.
- Switch to the Developer perspective.
- In the left menu, click Observe.
As described in monitoring project metrics, use the web console to run queries for
caikit_*
,tgi_*
,ovms_*
andvllm:*
model-serving runtime metrics. You can also run queries foristio_*
metrics that are related to OpenShift Service Mesh. Some examples are shown.The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
Additional resources
3.9. Performance tuning on the single-model serving platform
Certain performance issues might require you to tune the parameters of your inference service or model-serving runtime.
3.9.1. Resolving CUDA out-of-memory errors
In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
-
If you are using specialized OpenShift AI groups, you are part of the admin group (for example,
rhoai-admins
) in OpenShift.
Procedure
From the OpenShift AI dashboard, click Settings > Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Based on the runtime that you used to deploy your model, perform one of the following actions:
- If you used the pre-installed TGIS Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.
If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.
The embedded YAML editor opens and shows the contents of the custom model-serving runtime.
Add or update the
BATCH_SAFETY_MARGIN
environment variable and set the value to 30. Similarly, add or update theESTIMATE_MEMORY_BATCH_SIZE
environment variable and set the value to 8.spec: containers: env: - name: BATCH_SAFETY_MARGIN value: 30 - name: ESTIMATE_MEMORY_BATCH value: 8
NoteThe
BATCH_SAFETY_MARGIN
parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value ofBATCH_SAFETY_MARGIN
is20
. TheESTIMATE_MEMORY_BATCH_SIZE
parameter sets the batch size used in the memory auto-tuning algorithm. The default value ofESTIMATE_MEMORY_BATCH_SIZE
is16
.Click Update.
The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
To redeploy the model for the parameter updates to take effect, perform the following actions:
- From the OpenShift AI dashboard, click Model Serving > Deployed Models.
- Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.
- Redeploy the model as described in Deploying models on the single-model serving platform.
Verification
- You receive successful responses from the model server and no longer see CUDA OOM errors.